Class RegexLinkExtractor
- All Implemented Interfaces:
ILinkExtractor,IXMLConfigurable
Link extractor using regular expressions to extract links found in text
documents. Relative links are resolved to the document URL.
For HTML documents, it is best advised to use the
HtmlLinkExtractor or DOMLinkExtractor,
which addresses many cases specific to HTML.
Applicable documents
By default, this extractor will extract URLs only in documents having their content type matching this regular expression:
text/.*
You can specify your own restrictions using AbstractLinkExtractor.setRestrictions(List),
but make sure they represent text files.
Referrer data
The following referrer information is stored as metadata in each document represented by the extracted URLs:
- Referrer reference: The reference (URL) of the page where the
link to a document was found. Metadata value is
HttpDocMetadata.REFERRER_REFERENCE.
Character encoding
This extractor will by default attempt to
detect the encoding of the a page when extracting links and
referrer information. If no charset could be detected, it falls back to
UTF-8. It is also possible to dictate which encoding to use with
setCharset(String).
XML configuration usage:
<extractor
class="com.norconex.collector.http.link.impl.RegexLinkExtractor"
maxURLLength="(maximum URL length. Default is 2048)"
charset="(supported character encoding)">
<fieldMatcher>
(optional expression for fields used for links extraction instead
of the document stream)
</fieldMatcher>
<!-- Patterns for URLs to extract -->
<linkExtractionPatterns>
<pattern>
<match>(regular expression)</match>
<replace>(optional regex replacement)</replace>
</pattern>
<!-- you can have multiple pattern entries -->
</linkExtractionPatterns>
</extractor>
XML usage example:
<extractor
class="com.norconex.collector.http.link.impl.RegexLinkExtractor">
<linkExtractionPatterns>
<pattern>
<match>\[(\d+)\]</match>
<replace>http://www.example.com/page?id=$1</replace>
</pattern>
</linkExtractionPatterns>
</extractor>
The above example extracts page "ids" contained in square brackets and add them to a custom URL.
- Since:
- 2.7.0
- Author:
- Pascal Essiembre
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final Stringstatic final intDefault maximum length a URL can have.static final intstatic final int -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidaddPattern(String pattern) voidaddPattern(String pattern, String replacement) Adds a URL pattern, with an optional replacement.voidbooleanvoidextractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) Gets the character set of pages on which link extraction is performed.intGets the maximum supported URL length.getPatternReplacement(String pattern) Gets a pattern replacement.inthashCode()protected voidLoads configuration settings specific to the implementing class.protected voidSaves configuration settings specific to the implementing class.voidsetCharset(String charset) Sets the character set of pages on which link extraction is performed.voidsetMaxURLLength(int maxURLLength) Sets the maximum supported URL length.toString()Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcherMethods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
Field Details
-
DEFAULT_CONTENT_TYPE_PATTERN
- See Also:
-
MAX_BUFFER_SIZE
public static final int MAX_BUFFER_SIZE- See Also:
-
OVERLAP_SIZE
public static final int OVERLAP_SIZE- See Also:
-
DEFAULT_MAX_URL_LENGTH
public static final int DEFAULT_MAX_URL_LENGTHDefault maximum length a URL can have.- See Also:
-
-
Constructor Details
-
RegexLinkExtractor
public RegexLinkExtractor()
-
-
Method Details
-
extractTextLinks
- Specified by:
extractTextLinksin classAbstractTextLinkExtractor- Throws:
IOException
-
getMaxURLLength
public int getMaxURLLength()Gets the maximum supported URL length.- Returns:
- maximum URL length
-
setMaxURLLength
public void setMaxURLLength(int maxURLLength) Sets the maximum supported URL length.- Parameters:
maxURLLength- maximum URL length
-
getCharset
Gets the character set of pages on which link extraction is performed. Default isnull(charset detection will be attempted).- Returns:
- character set to use, or
null
-
setCharset
Sets the character set of pages on which link extraction is performed. Not specifying any (null) will attempt charset detection.- Parameters:
charset- character set to use, ornull
-
getPatterns
-
getPatternReplacement
Gets a pattern replacement.- Parameters:
pattern- the pattern for which to obtain its replacement- Returns:
- pattern replacement or
null(no replacement) - Since:
- 2.8.0
-
clearPatterns
public void clearPatterns() -
addPattern
-
addPattern
Adds a URL pattern, with an optional replacement.- Parameters:
pattern- a regular expressionreplacement- a regular expression replacement- Since:
- 2.8.0
-
loadTextLinkExtractorFromXML
Description copied from class:AbstractTextLinkExtractorLoads configuration settings specific to the implementing class.- Specified by:
loadTextLinkExtractorFromXMLin classAbstractTextLinkExtractor- Parameters:
xml- XML configuration
-
saveTextLinkExtractorToXML
Description copied from class:AbstractTextLinkExtractorSaves configuration settings specific to the implementing class.- Specified by:
saveTextLinkExtractorToXMLin classAbstractTextLinkExtractor- Parameters:
xml- the XML
-
equals
- Overrides:
equalsin classAbstractTextLinkExtractor
-
hashCode
public int hashCode()- Overrides:
hashCodein classAbstractTextLinkExtractor
-
toString
- Overrides:
toStringin classAbstractTextLinkExtractor
-