Class TikaLinkExtractor

  • All Implemented Interfaces:
    ILinkExtractor, IXMLConfigurable

    public class TikaLinkExtractor
    extends AbstractLinkExtractor

    Implementation of ILinkExtractor using Apache Tika to perform URL extractions from HTML documents. This is an alternative to the HtmlLinkExtractor.

    The configuration of content-types, storing the referrer data, and ignoring "nofollow" and ignoring link data are the same as in HtmlLinkExtractor. For link data, this parser only keeps a pre-defined set of link attributes, when available (title, type, uri, text, rel).

    XML configuration usage:

    
    <extractor
        class="com.norconex.collector.http.link.impl.TikaLinkExtractor"
        ignoreNofollow="[false|true]"/>
    Author:
    Pascal Essiembre
    See Also:
    HtmlLinkExtractor
    • Constructor Detail

      • TikaLinkExtractor

        public TikaLinkExtractor()