Class RegexLinkExtractor
- java.lang.Object
-
- com.norconex.collector.http.link.AbstractLinkExtractor
-
- com.norconex.collector.http.link.AbstractTextLinkExtractor
-
- com.norconex.collector.http.link.impl.RegexLinkExtractor
-
- All Implemented Interfaces:
ILinkExtractor
,IXMLConfigurable
public class RegexLinkExtractor extends AbstractTextLinkExtractor
Link extractor using regular expressions to extract links found in text documents. Relative links are resolved to the document URL. For HTML documents, it is best advised to use the
HtmlLinkExtractor
orDOMLinkExtractor
, which addresses many cases specific to HTML.Applicable documents
By default, this extractor will extract URLs only in documents having their content type matching this regular expression:
text/.*
You can specify your own restrictions using
AbstractLinkExtractor.setRestrictions(List)
, but make sure they represent text files.Referrer data
The following referrer information is stored as metadata in each document represented by the extracted URLs:
- Referrer reference: The reference (URL) of the page where the
link to a document was found. Metadata value is
HttpDocMetadata.REFERRER_REFERENCE
.
Character encoding
This extractor will by default attempt to detect the encoding of the a page when extracting links and referrer information. If no charset could be detected, it falls back to UTF-8. It is also possible to dictate which encoding to use with
setCharset(String)
.XML configuration usage:
<extractor class="com.norconex.collector.http.link.impl.RegexLinkExtractor" maxURLLength="(maximum URL length. Default is 2048)" charset="(supported character encoding)"> <fieldMatcher> (optional expression for fields used for links extraction instead of the document stream) </fieldMatcher> <!-- Patterns for URLs to extract --> <linkExtractionPatterns> <pattern> <match>(regular expression)</match> <replace>(optional regex replacement)</replace> </pattern> <!-- you can have multiple pattern entries --> </linkExtractionPatterns> </extractor>
XML usage example:
<extractor class="com.norconex.collector.http.link.impl.RegexLinkExtractor"> <linkExtractionPatterns> <pattern> <match>\[(\d+)\]</match> <replace>http://www.example.com/page?id=$1</replace> </pattern> </linkExtractionPatterns> </extractor>
The above example extracts page "ids" contained in square brackets and add them to a custom URL.
- Since:
- 2.7.0
- Author:
- Pascal Essiembre
-
-
Field Summary
Fields Modifier and Type Field Description static String
DEFAULT_CONTENT_TYPE_PATTERN
static int
DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.static int
MAX_BUFFER_SIZE
static int
OVERLAP_SIZE
-
Constructor Summary
Constructors Constructor Description RegexLinkExtractor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addPattern(String pattern)
void
addPattern(String pattern, String replacement)
Adds a URL pattern, with an optional replacement.void
clearPatterns()
boolean
equals(Object other)
void
extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)
String
getCharset()
Gets the character set of pages on which link extraction is performed.int
getMaxURLLength()
Gets the maximum supported URL length.String
getPatternReplacement(String pattern)
Gets a pattern replacement.List<String>
getPatterns()
int
hashCode()
protected void
loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setCharset(String charset)
Sets the character set of pages on which link extraction is performed.void
setMaxURLLength(int maxURLLength)
Sets the maximum supported URL length.String
toString()
-
Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher
-
Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
-
-
-
Field Detail
-
DEFAULT_CONTENT_TYPE_PATTERN
public static final String DEFAULT_CONTENT_TYPE_PATTERN
- See Also:
- Constant Field Values
-
MAX_BUFFER_SIZE
public static final int MAX_BUFFER_SIZE
- See Also:
- Constant Field Values
-
OVERLAP_SIZE
public static final int OVERLAP_SIZE
- See Also:
- Constant Field Values
-
DEFAULT_MAX_URL_LENGTH
public static final int DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.- See Also:
- Constant Field Values
-
-
Method Detail
-
extractTextLinks
public void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
- Specified by:
extractTextLinks
in classAbstractTextLinkExtractor
- Throws:
IOException
-
getMaxURLLength
public int getMaxURLLength()
Gets the maximum supported URL length.- Returns:
- maximum URL length
-
setMaxURLLength
public void setMaxURLLength(int maxURLLength)
Sets the maximum supported URL length.- Parameters:
maxURLLength
- maximum URL length
-
getCharset
public String getCharset()
Gets the character set of pages on which link extraction is performed. Default isnull
(charset detection will be attempted).- Returns:
- character set to use, or
null
-
setCharset
public void setCharset(String charset)
Sets the character set of pages on which link extraction is performed. Not specifying any (null
) will attempt charset detection.- Parameters:
charset
- character set to use, ornull
-
getPatternReplacement
public String getPatternReplacement(String pattern)
Gets a pattern replacement.- Parameters:
pattern
- the pattern for which to obtain its replacement- Returns:
- pattern replacement or
null
(no replacement) - Since:
- 2.8.0
-
clearPatterns
public void clearPatterns()
-
addPattern
public void addPattern(String pattern)
-
addPattern
public void addPattern(String pattern, String replacement)
Adds a URL pattern, with an optional replacement.- Parameters:
pattern
- a regular expressionreplacement
- a regular expression replacement- Since:
- 2.8.0
-
loadTextLinkExtractorFromXML
protected void loadTextLinkExtractorFromXML(XML xml)
Description copied from class:AbstractTextLinkExtractor
Loads configuration settings specific to the implementing class.- Specified by:
loadTextLinkExtractorFromXML
in classAbstractTextLinkExtractor
- Parameters:
xml
- XML configuration
-
saveTextLinkExtractorToXML
protected void saveTextLinkExtractorToXML(XML xml)
Description copied from class:AbstractTextLinkExtractor
Saves configuration settings specific to the implementing class.- Specified by:
saveTextLinkExtractorToXML
in classAbstractTextLinkExtractor
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractTextLinkExtractor
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractTextLinkExtractor
-
toString
public String toString()
- Overrides:
toString
in classAbstractTextLinkExtractor
-
-