Class RegexLinkExtractor

All Implemented Interfaces:
ILinkExtractor, IXMLConfigurable

public class RegexLinkExtractor extends AbstractTextLinkExtractor

Link extractor using regular expressions to extract links found in text documents. Relative links are resolved to the document URL. For HTML documents, it is best advised to use the HtmlLinkExtractor or DOMLinkExtractor, which addresses many cases specific to HTML.

Applicable documents

By default, this extractor will extract URLs only in documents having their content type matching this regular expression:

 text/.*
 

You can specify your own restrictions using AbstractLinkExtractor.setRestrictions(List), but make sure they represent text files.

Referrer data

The following referrer information is stored as metadata in each document represented by the extracted URLs:

Character encoding

This extractor will by default attempt to detect the encoding of the a page when extracting links and referrer information. If no charset could be detected, it falls back to UTF-8. It is also possible to dictate which encoding to use with setCharset(String).

XML configuration usage:


<extractor
    class="com.norconex.collector.http.link.impl.RegexLinkExtractor"
    maxURLLength="(maximum URL length. Default is 2048)"
    charset="(supported character encoding)">
  <fieldMatcher>
    (optional expression for fields used for links extraction instead
     of the document stream)
  </fieldMatcher>
  <!-- Patterns for URLs to extract -->
  <linkExtractionPatterns>
    <pattern>
      <match>(regular expression)</match>
      <replace>(optional regex replacement)</replace>
    </pattern>
    <!-- you can have multiple pattern entries -->
  </linkExtractionPatterns>
</extractor>

XML usage example:


<extractor
    class="com.norconex.collector.http.link.impl.RegexLinkExtractor">
  <linkExtractionPatterns>
    <pattern>
      <match>\[(\d+)\]</match>
      <replace>http://www.example.com/page?id=$1</replace>
    </pattern>
  </linkExtractionPatterns>
</extractor>

The above example extracts page "ids" contained in square brackets and add them to a custom URL.

Since:
2.7.0
Author:
Pascal Essiembre
  • Field Details

  • Constructor Details

    • RegexLinkExtractor

      public RegexLinkExtractor()
  • Method Details

    • extractTextLinks

      public void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
      Specified by:
      extractTextLinks in class AbstractTextLinkExtractor
      Throws:
      IOException
    • getMaxURLLength

      public int getMaxURLLength()
      Gets the maximum supported URL length.
      Returns:
      maximum URL length
    • setMaxURLLength

      public void setMaxURLLength(int maxURLLength)
      Sets the maximum supported URL length.
      Parameters:
      maxURLLength - maximum URL length
    • getCharset

      public String getCharset()
      Gets the character set of pages on which link extraction is performed. Default is null (charset detection will be attempted).
      Returns:
      character set to use, or null
    • setCharset

      public void setCharset(String charset)
      Sets the character set of pages on which link extraction is performed. Not specifying any (null) will attempt charset detection.
      Parameters:
      charset - character set to use, or null
    • getPatterns

      public List<String> getPatterns()
    • getPatternReplacement

      public String getPatternReplacement(String pattern)
      Gets a pattern replacement.
      Parameters:
      pattern - the pattern for which to obtain its replacement
      Returns:
      pattern replacement or null (no replacement)
      Since:
      2.8.0
    • clearPatterns

      public void clearPatterns()
    • addPattern

      public void addPattern(String pattern)
    • addPattern

      public void addPattern(String pattern, String replacement)
      Adds a URL pattern, with an optional replacement.
      Parameters:
      pattern - a regular expression
      replacement - a regular expression replacement
      Since:
      2.8.0
    • loadTextLinkExtractorFromXML

      protected void loadTextLinkExtractorFromXML(XML xml)
      Description copied from class: AbstractTextLinkExtractor
      Loads configuration settings specific to the implementing class.
      Specified by:
      loadTextLinkExtractorFromXML in class AbstractTextLinkExtractor
      Parameters:
      xml - XML configuration
    • saveTextLinkExtractorToXML

      protected void saveTextLinkExtractorToXML(XML xml)
      Description copied from class: AbstractTextLinkExtractor
      Saves configuration settings specific to the implementing class.
      Specified by:
      saveTextLinkExtractorToXML in class AbstractTextLinkExtractor
      Parameters:
      xml - the XML
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class AbstractTextLinkExtractor
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class AbstractTextLinkExtractor
    • toString

      public String toString()
      Overrides:
      toString in class AbstractTextLinkExtractor