Class TikaLinkExtractor

java.lang.Object
com.norconex.collector.http.link.AbstractLinkExtractor
com.norconex.collector.http.link.impl.TikaLinkExtractor
All Implemented Interfaces:
ILinkExtractor, IXMLConfigurable

public class TikaLinkExtractor extends AbstractLinkExtractor

Implementation of ILinkExtractor using Apache Tika to perform URL extractions from HTML documents. This is an alternative to the HtmlLinkExtractor.

The configuration of content-types, storing the referrer data, and ignoring "nofollow" and ignoring link data are the same as in HtmlLinkExtractor. For link data, this parser only keeps a pre-defined set of link attributes, when available (title, type, uri, text, rel).

XML configuration usage:


<extractor
    class="com.norconex.collector.http.link.impl.TikaLinkExtractor"
    ignoreNofollow="[false|true]"/>
Author:
Pascal Essiembre
See Also:
  • Constructor Details

    • TikaLinkExtractor

      public TikaLinkExtractor()
  • Method Details