public class RegexLinkExtractor extends Object implements ILinkExtractor, IXMLConfigurable
Link extractor using regular expressions to extract links found in text
documents. Relative links are resolved to the document URL.
For HTML documents, it is best advised to use the
GenericLinkExtractor
, which addresses many cases specific to HTML.
By default, this extractor will extract URLs only in documents having their content type matching this regular expression:
text/.*
You can specify your own content types or reference restriction patterns
using setApplyToContentTypePattern(String)
or
setApplyToReferencePattern(String)
, but make sure they
represent text files. When both methods are used, a document should be
be matched by both to be accepted.
The following referrer information is stored as metadata in each document represented by the extracted URLs:
HttpMetadata.COLLECTOR_REFERRER_REFERENCE
.This extractor will by default attempt to
detect the encoding of the a page when extracting links and
referrer information. If no charset could be detected, it falls back to
UTF-8. It is also possible to dictate which encoding to use with
setCharset(String)
.
<extractor class="com.norconex.collector.http.url.impl.RegexLinkExtractor" maxURLLength="(maximum URL length. Default is 2048)" charset="(supported character encoding)" > <applyToContentTypePattern> (Regular expression matching content types this extractor should apply to. Default accepts "text/.*".) </applyToContentTypePattern> <applyToReferencePattern> (Regular expression matching references this extractor should apply to. Default accepts all references.) </applyToReferencePattern> <!-- Patterns for URLs to extract --> <linkExtractionPatterns> <pattern> <match>(regular expression)</match> <replace>(optional regex replacement)</replace> </pattern> <!-- you can have multiple pattern entries --> </linkExtractionPatterns> </extractor>
The following extracts page "ids" contained in square brackets and add them to a custom URL.
<extractor class="com.norconex.collector.http.url.impl.RegexLinkExtractor"> <linkExtractionPatterns> <pattern> <match>\[(\d+)\]<match> <replace>http://www.example.com/page?id=$1<replace> </pattern> </linkExtractionPatterns> </extractor>
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_CONTENT_TYPE_PATTERN |
static int |
DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.
|
static int |
MAX_BUFFER_SIZE |
static int |
OVERLAP_SIZE |
Constructor and Description |
---|
RegexLinkExtractor() |
Modifier and Type | Method and Description |
---|---|
boolean |
accepts(String url,
ContentType contentType)
Whether this link extraction should be executed for the given URL
and/or content type.
|
void |
addPattern(String pattern) |
void |
addPattern(String pattern,
int matchGroup)
Deprecated.
Since 2.8.0, use
addPattern(String, String) instead. |
void |
addPattern(String pattern,
String replacement)
Adds a URL pattern, with an optional replacement.
|
void |
clearPatterns() |
boolean |
equals(Object other) |
Set<Link> |
extractLinks(InputStream input,
String reference,
ContentType contentType)
Extracts links from a document.
|
String |
getApplyToContentTypePattern() |
String |
getApplyToReferencePattern() |
String |
getCharset()
Gets the character set of pages on which link extraction is performed.
|
int |
getMaxURLLength()
Gets the maximum supported URL length.
|
int |
getPatternMatchGroup(String pattern)
Deprecated.
Since 2.8.0, use #getPatternReplacement(String) instead.
It will return the group id if the "replacement" value only contains
a group replacement (e.g. $1), else, it will always return -1.
|
String |
getPatternReplacement(String pattern)
Gets a pattern replacement.
|
List<String> |
getPatterns() |
int |
hashCode() |
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setApplyToContentTypePattern(String applyToContentTypePattern) |
void |
setApplyToReferencePattern(String applyToReferencePattern) |
void |
setCharset(String charset)
Sets the character set of pages on which link extraction is performed.
|
void |
setMaxURLLength(int maxURLLength)
Sets the maximum supported URL length.
|
String |
toString() |
public static final String DEFAULT_CONTENT_TYPE_PATTERN
public static final int MAX_BUFFER_SIZE
public static final int OVERLAP_SIZE
public static final int DEFAULT_MAX_URL_LENGTH
public Set<Link> extractLinks(InputStream input, String reference, ContentType contentType) throws IOException
ILinkExtractor
extractLinks
in interface ILinkExtractor
input
- the document input streamreference
- document reference (URL)contentType
- the document content typeIOException
- problem reading the documentpublic boolean accepts(String url, ContentType contentType)
ILinkExtractor
accepts
in interface ILinkExtractor
url
- the urlcontentType
- the content typetrue
if the given URL is acceptedpublic int getMaxURLLength()
public void setMaxURLLength(int maxURLLength)
maxURLLength
- maximum URL lengthpublic String getCharset()
null
(charset detection will be attempted).null
public void setCharset(String charset)
null
) will attempt charset detection.charset
- character set to use, or null
public String getApplyToContentTypePattern()
public void setApplyToContentTypePattern(String applyToContentTypePattern)
public String getApplyToReferencePattern()
public void setApplyToReferencePattern(String applyToReferencePattern)
public String getPatternReplacement(String pattern)
pattern
- the pattern for which to obtain its replacementnull
(no replacement)@Deprecated public int getPatternMatchGroup(String pattern)
pattern
- the pattern for which to obtain its replacementpublic void clearPatterns()
public void addPattern(String pattern)
public void addPattern(String pattern, String replacement)
pattern
- a regular expressionreplacement
- a regular expression replacement@Deprecated public void addPattern(String pattern, int matchGroup)
addPattern(String, String)
instead.pattern
- a regular expressionmatchGroup
- regular expression match grouppublic void loadFromXML(Reader in)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.