Class RegexLinkExtractor

  • All Implemented Interfaces:
    ILinkExtractor, IXMLConfigurable

    public class RegexLinkExtractor
    extends AbstractTextLinkExtractor

    Link extractor using regular expressions to extract links found in text documents. Relative links are resolved to the document URL. For HTML documents, it is best advised to use the HtmlLinkExtractor or DOMLinkExtractor, which addresses many cases specific to HTML.

    Applicable documents

    By default, this extractor will extract URLs only in documents having their content type matching this regular expression:

     text/.*
     

    You can specify your own restrictions using AbstractLinkExtractor.setRestrictions(List), but make sure they represent text files.

    Referrer data

    The following referrer information is stored as metadata in each document represented by the extracted URLs:

    Character encoding

    This extractor will by default attempt to detect the encoding of the a page when extracting links and referrer information. If no charset could be detected, it falls back to UTF-8. It is also possible to dictate which encoding to use with setCharset(String).

    XML configuration usage:

    
    <extractor
        class="com.norconex.collector.http.link.impl.RegexLinkExtractor"
        maxURLLength="(maximum URL length. Default is 2048)"
        charset="(supported character encoding)">
      <fieldMatcher>
        (optional expression for fields used for links extraction instead
         of the document stream)
      </fieldMatcher>
      <!-- Patterns for URLs to extract -->
      <linkExtractionPatterns>
        <pattern>
          <match>(regular expression)</match>
          <replace>(optional regex replacement)</replace>
        </pattern>
        <!-- you can have multiple pattern entries -->
      </linkExtractionPatterns>
    </extractor>

    XML usage example:

    
    <extractor
        class="com.norconex.collector.http.link.impl.RegexLinkExtractor">
      <linkExtractionPatterns>
        <pattern>
          <match>\[(\d+)\]</match>
          <replace>http://www.example.com/page?id=$1</replace>
        </pattern>
      </linkExtractionPatterns>
    </extractor>

    The above example extracts page "ids" contained in square brackets and add them to a custom URL.

    Since:
    2.7.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • RegexLinkExtractor

        public RegexLinkExtractor()
    • Method Detail

      • getMaxURLLength

        public int getMaxURLLength()
        Gets the maximum supported URL length.
        Returns:
        maximum URL length
      • setMaxURLLength

        public void setMaxURLLength​(int maxURLLength)
        Sets the maximum supported URL length.
        Parameters:
        maxURLLength - maximum URL length
      • getCharset

        public String getCharset()
        Gets the character set of pages on which link extraction is performed. Default is null (charset detection will be attempted).
        Returns:
        character set to use, or null
      • setCharset

        public void setCharset​(String charset)
        Sets the character set of pages on which link extraction is performed. Not specifying any (null) will attempt charset detection.
        Parameters:
        charset - character set to use, or null
      • getPatternReplacement

        public String getPatternReplacement​(String pattern)
        Gets a pattern replacement.
        Parameters:
        pattern - the pattern for which to obtain its replacement
        Returns:
        pattern replacement or null (no replacement)
        Since:
        2.8.0
      • clearPatterns

        public void clearPatterns()
      • addPattern

        public void addPattern​(String pattern)
      • addPattern

        public void addPattern​(String pattern,
                               String replacement)
        Adds a URL pattern, with an optional replacement.
        Parameters:
        pattern - a regular expression
        replacement - a regular expression replacement
        Since:
        2.8.0