Class AbstractTextLinkExtractor

  • All Implemented Interfaces:
    ILinkExtractor, IXMLConfigurable
    Direct Known Subclasses:
    DOMLinkExtractor, HtmlLinkExtractor, RegexLinkExtractor, XMLFeedLinkExtractor

    public abstract class AbstractTextLinkExtractor
    extends AbstractLinkExtractor

    Base class for link extraction from text documents, providing common configuration settings such as being able to apply extraction to specific documents only, and being able to specify one or more metadata fields from which to grab the text for extracting links.

    Not suitable for binary files.

    Subclasses inherit the following:

    XML configuration usage:

    
    <fieldMatcher>
      (optional expression for fields used for links extraction instead
       of the document stream)
    </fieldMatcher>

    XML usage example:

    
    

    The above will apply to any content type starting with "text/".

    Since:
    3.0.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • AbstractTextLinkExtractor

        public AbstractTextLinkExtractor()
    • Method Detail

      • getFieldMatcher

        public TextMatcher getFieldMatcher()
        Gets field matcher identifying fields holding content used for link extraction. Default is null, using the document content stream instead.
        Returns:
        field matcher
      • setFieldMatcher

        public void setFieldMatcher​(TextMatcher fieldMatcher)
        Gets field matcher identifying fields holding content used for link extraction. Default is null, using the document content stream instead.
        Parameters:
        fieldMatcher - field matcher
      • loadTextLinkExtractorFromXML

        protected abstract void loadTextLinkExtractorFromXML​(XML xml)
        Loads configuration settings specific to the implementing class.
        Parameters:
        xml - XML configuration
      • saveTextLinkExtractorToXML

        protected abstract void saveTextLinkExtractorToXML​(XML xml)
        Saves configuration settings specific to the implementing class.
        Parameters:
        xml - the XML