Class AbstractTextLinkExtractor

java.lang.Object
com.norconex.collector.http.link.AbstractLinkExtractor
com.norconex.collector.http.link.AbstractTextLinkExtractor
All Implemented Interfaces:
ILinkExtractor, IXMLConfigurable
Direct Known Subclasses:
DOMLinkExtractor, HtmlLinkExtractor, RegexLinkExtractor, XMLFeedLinkExtractor

public abstract class AbstractTextLinkExtractor extends AbstractLinkExtractor

Base class for link extraction from text documents, providing common configuration settings such as being able to apply extraction to specific documents only, and being able to specify one or more metadata fields from which to grab the text for extracting links.

Not suitable for binary files.

Subclasses inherit the following:

XML configuration usage:


<fieldMatcher>
  (optional expression for fields used for links extraction instead
   of the document stream)
</fieldMatcher>

XML usage example:


The above will apply to any content type starting with "text/".

Since:
3.0.0
Author:
Pascal Essiembre
  • Constructor Details

    • AbstractTextLinkExtractor

      public AbstractTextLinkExtractor()
  • Method Details

    • extractLinks

      public final void extractLinks(Set<Link> links, CrawlDoc doc) throws IOException
      Specified by:
      extractLinks in class AbstractLinkExtractor
      Throws:
      IOException
    • extractTextLinks

      public abstract void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
      Throws:
      IOException
    • getFieldMatcher

      public TextMatcher getFieldMatcher()
      Gets field matcher identifying fields holding content used for link extraction. Default is null, using the document content stream instead.
      Returns:
      field matcher
    • setFieldMatcher

      public void setFieldMatcher(TextMatcher fieldMatcher)
      Gets field matcher identifying fields holding content used for link extraction. Default is null, using the document content stream instead.
      Parameters:
      fieldMatcher - field matcher
    • loadLinkExtractorFromXML

      public final void loadLinkExtractorFromXML(XML xml)
      Description copied from class: AbstractLinkExtractor
      Loads configuration settings specific to the implementing class.
      Specified by:
      loadLinkExtractorFromXML in class AbstractLinkExtractor
      Parameters:
      xml - XML configuration
    • loadTextLinkExtractorFromXML

      protected abstract void loadTextLinkExtractorFromXML(XML xml)
      Loads configuration settings specific to the implementing class.
      Parameters:
      xml - XML configuration
    • saveLinkExtractorToXML

      protected final void saveLinkExtractorToXML(XML xml)
      Description copied from class: AbstractLinkExtractor
      Saves configuration settings specific to the implementing class.
      Specified by:
      saveLinkExtractorToXML in class AbstractLinkExtractor
      Parameters:
      xml - the XML
    • saveTextLinkExtractorToXML

      protected abstract void saveTextLinkExtractorToXML(XML xml)
      Saves configuration settings specific to the implementing class.
      Parameters:
      xml - the XML
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class AbstractLinkExtractor
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class AbstractLinkExtractor
    • toString

      public String toString()
      Overrides:
      toString in class AbstractLinkExtractor