java.lang.Object
- com.norconex.collector.http.link.AbstractLinkExtractor
- - com.norconex.collector.http.link.AbstractTextLinkExtractor
  - - com.norconex.collector.http.link.impl.RegexLinkExtractor

All Implemented Interfaces:

ILinkExtractor, IXMLConfigurable
```
public class RegexLinkExtractor
extends AbstractTextLinkExtractor
```
Link extractor using regular expressions to extract links found in text documents. Relative links are resolved to the document URL. For HTML documents, it is best advised to use the HtmlLinkExtractor or DOMLinkExtractor, which addresses many cases specific to HTML.

Applicable documents

By default, this extractor will extract URLs only in documents having their content type matching this regular expression:
```
 text/.*
 
```
You can specify your own restrictions using AbstractLinkExtractor.setRestrictions(List), but make sure they represent text files.

Referrer data

The following referrer information is stored as metadata in each document represented by the extracted URLs:
- Referrer reference: The reference (URL) of the page where the link to a document was found. Metadata value is HttpDocMetadata.REFERRER_REFERENCE.
Character encoding

This extractor will by default attempt to detect the encoding of the a page when extracting links and referrer information. If no charset could be detected, it falls back to UTF-8. It is also possible to dictate which encoding to use with setCharset(String).

XML configuration usage:
```
<extractor
    class="com.norconex.collector.http.link.impl.RegexLinkExtractor"
    maxURLLength="(maximum URL length. Default is 2048)"
    charset="(supported character encoding)">
  <fieldMatcher>
    (optional expression for fields used for links extraction instead
     of the document stream)
  </fieldMatcher>
  
  <linkExtractionPatterns>
    <pattern>
      <match>(regular expression)</match>
      <replace>(optional regex replacement)</replace>
    </pattern>
    
  </linkExtractionPatterns>
</extractor>
```
XML usage example:
```
<extractor
    class="com.norconex.collector.http.link.impl.RegexLinkExtractor">
  <linkExtractionPatterns>
    <pattern>
      <match>\[(\d+)\]</match>
      <replace>http://www.example.com/page?id=$1</replace>
    </pattern>
  </linkExtractionPatterns>
</extractor>
```
The above example extracts page "ids" contained in square brackets and add them to a custom URL.
Since:

2.7.0

Author:

Pascal Essiembre

Field Summary

Fields
Modifier and Type	Field	Description
`static String`	`DEFAULT_CONTENT_TYPE_PATTERN`
`static int`	`DEFAULT_MAX_URL_LENGTH`	Default maximum length a URL can have.
`static int`	`MAX_BUFFER_SIZE`
`static int`	`OVERLAP_SIZE`

Constructor Summary

Constructors
Constructor Description

RegexLinkExtractor()

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`void`	`addPattern(String pattern)`
`void`	`addPattern(String pattern, String replacement)`	Adds a URL pattern, with an optional replacement.
`void`	`clearPatterns()`
`boolean`	`equals(Object other)`
`void`	`extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)`
`String`	`getCharset()`	Gets the character set of pages on which link extraction is performed.
`int`	`getMaxURLLength()`	Gets the maximum supported URL length.
`String`	`getPatternReplacement(String pattern)`	Gets a pattern replacement.
`List<String>`	`getPatterns()`
`int`	`hashCode()`
`protected void`	`loadTextLinkExtractorFromXML(XML xml)`	Loads configuration settings specific to the implementing class.
`protected void`	`saveTextLinkExtractorToXML(XML xml)`	Saves configuration settings specific to the implementing class.
`void`	`setCharset(String charset)`	Sets the character set of pages on which link extraction is performed.
`void`	`setMaxURLLength(int maxURLLength)`	Sets the maximum supported URL length.
`String`	`toString()`

Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher

Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - DEFAULT_CONTENT_TYPE_PATTERN
```
public static final String DEFAULT_CONTENT_TYPE_PATTERN
```
    See Also:
    
    Constant Field Values
  - MAX_BUFFER_SIZE
```
public static final int MAX_BUFFER_SIZE
```
    See Also:
    
    Constant Field Values
  - OVERLAP_SIZE
```
public static final int OVERLAP_SIZE
```
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_URL_LENGTH
```
public static final int DEFAULT_MAX_URL_LENGTH
```
    Default maximum length a URL can have.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - RegexLinkExtractor
```
public RegexLinkExtractor()
```
- Method Detail
  - extractTextLinks
```
public void extractTextLinks(Set<Link> links,
                             HandlerDoc doc,
                             Reader reader)
                      throws IOException
```
    Specified by:
    
    extractTextLinks in class AbstractTextLinkExtractor
    
    Throws:
    
    IOException
  - getMaxURLLength
```
public int getMaxURLLength()
```
    Gets the maximum supported URL length.
    
    Returns:
    
    maximum URL length
  - setMaxURLLength
```
public void setMaxURLLength(int maxURLLength)
```
    Sets the maximum supported URL length.
    
    Parameters:
    
    maxURLLength - maximum URL length
  - getCharset
```
public String getCharset()
```
    Gets the character set of pages on which link extraction is performed. Default is null (charset detection will be attempted).
    
    Returns:
    
    character set to use, or null
  - setCharset
```
public void setCharset(String charset)
```
    Sets the character set of pages on which link extraction is performed. Not specifying any (null) will attempt charset detection.
    
    Parameters:
    
    charset - character set to use, or null
  - getPatterns
```
public List<String> getPatterns()
```
  - getPatternReplacement
```
public String getPatternReplacement(String pattern)
```
    Gets a pattern replacement.
    
    Parameters:
    
    pattern - the pattern for which to obtain its replacement
    
    Returns:
    
    pattern replacement or null (no replacement)
    
    Since:
    
    2.8.0
  - clearPatterns
```
public void clearPatterns()
```
  - addPattern
```
public void addPattern(String pattern)
```
  - addPattern
```
public void addPattern(String pattern,
                       String replacement)
```
    Adds a URL pattern, with an optional replacement.
    
    Parameters:
    
    pattern - a regular expression
    
    replacement - a regular expression replacement
    
    Since:
    
    2.8.0
  - loadTextLinkExtractorFromXML
```
protected void loadTextLinkExtractorFromXML(XML xml)
```
    Description copied from class: AbstractTextLinkExtractor
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadTextLinkExtractorFromXML in class AbstractTextLinkExtractor
    
    Parameters:
    
    xml - XML configuration
  - saveTextLinkExtractorToXML
```
protected void saveTextLinkExtractorToXML(XML xml)
```
    Description copied from class: AbstractTextLinkExtractor
    
    Saves configuration settings specific to the implementing class.
    
    Specified by:
    
    saveTextLinkExtractorToXML in class AbstractTextLinkExtractor
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractTextLinkExtractor
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractTextLinkExtractor
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractTextLinkExtractor

Class RegexLinkExtractor

Applicable documents

Referrer data

Character encoding

XML configuration usage:

XML usage example:

Field Summary

Constructor Summary

Method Summary

Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor

Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_CONTENT_TYPE_PATTERN

MAX_BUFFER_SIZE

OVERLAP_SIZE

DEFAULT_MAX_URL_LENGTH

Constructor Detail

RegexLinkExtractor

Method Detail

extractTextLinks

getMaxURLLength

setMaxURLLength

getCharset

setCharset

getPatterns

getPatternReplacement

clearPatterns

addPattern

addPattern

loadTextLinkExtractorFromXML

saveTextLinkExtractorToXML

equals

hashCode

toString