Class RegexLinkExtractor
- java.lang.Object
-
- com.norconex.collector.http.link.AbstractLinkExtractor
-
- com.norconex.collector.http.link.AbstractTextLinkExtractor
-
- com.norconex.collector.http.link.impl.RegexLinkExtractor
-
- All Implemented Interfaces:
ILinkExtractor,IXMLConfigurable
public class RegexLinkExtractor extends AbstractTextLinkExtractor
Link extractor using regular expressions to extract links found in text documents. Relative links are resolved to the document URL. For HTML documents, it is best advised to use the
HtmlLinkExtractororDOMLinkExtractor, which addresses many cases specific to HTML.Applicable documents
By default, this extractor will extract URLs only in documents having their content type matching this regular expression:
text/.*
You can specify your own restrictions using
AbstractLinkExtractor.setRestrictions(List), but make sure they represent text files.Referrer data
The following referrer information is stored as metadata in each document represented by the extracted URLs:
- Referrer reference: The reference (URL) of the page where the
link to a document was found. Metadata value is
HttpDocMetadata.REFERRER_REFERENCE.
Character encoding
This extractor will by default attempt to detect the encoding of the a page when extracting links and referrer information. If no charset could be detected, it falls back to UTF-8. It is also possible to dictate which encoding to use with
setCharset(String).XML configuration usage:
<extractor class="com.norconex.collector.http.link.impl.RegexLinkExtractor" maxURLLength="(maximum URL length. Default is 2048)" charset="(supported character encoding)"> <fieldMatcher> (optional expression for fields used for links extraction instead of the document stream) </fieldMatcher> <!-- Patterns for URLs to extract --> <linkExtractionPatterns> <pattern> <match>(regular expression)</match> <replace>(optional regex replacement)</replace> </pattern> <!-- you can have multiple pattern entries --> </linkExtractionPatterns> </extractor>XML usage example:
<extractor class="com.norconex.collector.http.link.impl.RegexLinkExtractor"> <linkExtractionPatterns> <pattern> <match>\[(\d+)\]</match> <replace>http://www.example.com/page?id=$1</replace> </pattern> </linkExtractionPatterns> </extractor>The above example extracts page "ids" contained in square brackets and add them to a custom URL.
- Since:
- 2.7.0
- Author:
- Pascal Essiembre
-
-
Field Summary
Fields Modifier and Type Field Description static StringDEFAULT_CONTENT_TYPE_PATTERNstatic intDEFAULT_MAX_URL_LENGTHDefault maximum length a URL can have.static intMAX_BUFFER_SIZEstatic intOVERLAP_SIZE
-
Constructor Summary
Constructors Constructor Description RegexLinkExtractor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddPattern(String pattern)voidaddPattern(String pattern, String replacement)Adds a URL pattern, with an optional replacement.voidclearPatterns()booleanequals(Object other)voidextractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)StringgetCharset()Gets the character set of pages on which link extraction is performed.intgetMaxURLLength()Gets the maximum supported URL length.StringgetPatternReplacement(String pattern)Gets a pattern replacement.List<String>getPatterns()inthashCode()protected voidloadTextLinkExtractorFromXML(XML xml)Loads configuration settings specific to the implementing class.protected voidsaveTextLinkExtractorToXML(XML xml)Saves configuration settings specific to the implementing class.voidsetCharset(String charset)Sets the character set of pages on which link extraction is performed.voidsetMaxURLLength(int maxURLLength)Sets the maximum supported URL length.StringtoString()-
Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher
-
Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
-
-
-
Field Detail
-
DEFAULT_CONTENT_TYPE_PATTERN
public static final String DEFAULT_CONTENT_TYPE_PATTERN
- See Also:
- Constant Field Values
-
MAX_BUFFER_SIZE
public static final int MAX_BUFFER_SIZE
- See Also:
- Constant Field Values
-
OVERLAP_SIZE
public static final int OVERLAP_SIZE
- See Also:
- Constant Field Values
-
DEFAULT_MAX_URL_LENGTH
public static final int DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.- See Also:
- Constant Field Values
-
-
Method Detail
-
extractTextLinks
public void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
- Specified by:
extractTextLinksin classAbstractTextLinkExtractor- Throws:
IOException
-
getMaxURLLength
public int getMaxURLLength()
Gets the maximum supported URL length.- Returns:
- maximum URL length
-
setMaxURLLength
public void setMaxURLLength(int maxURLLength)
Sets the maximum supported URL length.- Parameters:
maxURLLength- maximum URL length
-
getCharset
public String getCharset()
Gets the character set of pages on which link extraction is performed. Default isnull(charset detection will be attempted).- Returns:
- character set to use, or
null
-
setCharset
public void setCharset(String charset)
Sets the character set of pages on which link extraction is performed. Not specifying any (null) will attempt charset detection.- Parameters:
charset- character set to use, ornull
-
getPatternReplacement
public String getPatternReplacement(String pattern)
Gets a pattern replacement.- Parameters:
pattern- the pattern for which to obtain its replacement- Returns:
- pattern replacement or
null(no replacement) - Since:
- 2.8.0
-
clearPatterns
public void clearPatterns()
-
addPattern
public void addPattern(String pattern)
-
addPattern
public void addPattern(String pattern, String replacement)
Adds a URL pattern, with an optional replacement.- Parameters:
pattern- a regular expressionreplacement- a regular expression replacement- Since:
- 2.8.0
-
loadTextLinkExtractorFromXML
protected void loadTextLinkExtractorFromXML(XML xml)
Description copied from class:AbstractTextLinkExtractorLoads configuration settings specific to the implementing class.- Specified by:
loadTextLinkExtractorFromXMLin classAbstractTextLinkExtractor- Parameters:
xml- XML configuration
-
saveTextLinkExtractorToXML
protected void saveTextLinkExtractorToXML(XML xml)
Description copied from class:AbstractTextLinkExtractorSaves configuration settings specific to the implementing class.- Specified by:
saveTextLinkExtractorToXMLin classAbstractTextLinkExtractor- Parameters:
xml- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equalsin classAbstractTextLinkExtractor
-
hashCode
public int hashCode()
- Overrides:
hashCodein classAbstractTextLinkExtractor
-
toString
public String toString()
- Overrides:
toStringin classAbstractTextLinkExtractor
-
-