public class RegexLinkExtractor extends AbstractTextLinkExtractor
Link extractor using regular expressions to extract links found in text
documents. Relative links are resolved to the document URL.
For HTML documents, it is best advised to use the
HtmlLinkExtractor
or DOMLinkExtractor
,
which addresses many cases specific to HTML.
By default, this extractor will extract URLs only in documents having their content type matching this regular expression:
text/.*
You can specify your own restrictions using AbstractLinkExtractor.setRestrictions(List)
,
but make sure they represent text files.
The following referrer information is stored as metadata in each document represented by the extracted URLs:
HttpDocMetadata.REFERRER_REFERENCE
.This extractor will by default attempt to
detect the encoding of the a page when extracting links and
referrer information. If no charset could be detected, it falls back to
UTF-8. It is also possible to dictate which encoding to use with
setCharset(String)
.
<extractor
class="com.norconex.collector.http.link.impl.RegexLinkExtractor"
maxURLLength="(maximum URL length. Default is 2048)"
charset="(supported character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(optional expression for fields used for links extraction instead
of the document stream)
</fieldMatcher>
<!-- Patterns for URLs to extract -->
<linkExtractionPatterns>
<pattern>
<match>(regular expression)</match>
<replace>(optional regex replacement)</replace>
</pattern>
<!-- you can have multiple pattern entries -->
</linkExtractionPatterns>
</extractor>
<extractor
class="com.norconex.collector.http.link.impl.RegexLinkExtractor">
<linkExtractionPatterns>
<pattern>
<match>\[(\d+)\]</match>
<replace>http://www.example.com/page?id=$1</replace>
</pattern>
</linkExtractionPatterns>
</extractor>
The above example extracts page "ids" contained in square brackets and add them to a custom URL.
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_CONTENT_TYPE_PATTERN |
static int |
DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.
|
static int |
MAX_BUFFER_SIZE |
static int |
OVERLAP_SIZE |
Constructor and Description |
---|
RegexLinkExtractor() |
Modifier and Type | Method and Description |
---|---|
void |
addPattern(String pattern) |
void |
addPattern(String pattern,
String replacement)
Adds a URL pattern, with an optional replacement.
|
void |
clearPatterns() |
boolean |
equals(Object other) |
void |
extractTextLinks(Set<Link> links,
HandlerDoc doc,
Reader reader) |
String |
getCharset()
Gets the character set of pages on which link extraction is performed.
|
int |
getMaxURLLength()
Gets the maximum supported URL length.
|
String |
getPatternReplacement(String pattern)
Gets a pattern replacement.
|
List<String> |
getPatterns() |
int |
hashCode() |
protected void |
loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setCharset(String charset)
Sets the character set of pages on which link extraction is performed.
|
void |
setMaxURLLength(int maxURLLength)
Sets the maximum supported URL length.
|
String |
toString() |
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
public static final String DEFAULT_CONTENT_TYPE_PATTERN
public static final int MAX_BUFFER_SIZE
public static final int OVERLAP_SIZE
public static final int DEFAULT_MAX_URL_LENGTH
public void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
extractTextLinks
in class AbstractTextLinkExtractor
IOException
public int getMaxURLLength()
public void setMaxURLLength(int maxURLLength)
maxURLLength
- maximum URL lengthpublic String getCharset()
null
(charset detection will be attempted).null
public void setCharset(String charset)
null
) will attempt charset detection.charset
- character set to use, or null
public String getPatternReplacement(String pattern)
pattern
- the pattern for which to obtain its replacementnull
(no replacement)public void clearPatterns()
public void addPattern(String pattern)
public void addPattern(String pattern, String replacement)
pattern
- a regular expressionreplacement
- a regular expression replacementprotected void loadTextLinkExtractorFromXML(XML xml)
AbstractTextLinkExtractor
loadTextLinkExtractorFromXML
in class AbstractTextLinkExtractor
xml
- XML configurationprotected void saveTextLinkExtractorToXML(XML xml)
AbstractTextLinkExtractor
saveTextLinkExtractorToXML
in class AbstractTextLinkExtractor
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractTextLinkExtractor
public int hashCode()
hashCode
in class AbstractTextLinkExtractor
public String toString()
toString
in class AbstractTextLinkExtractor
Copyright © 2009–2023 Norconex Inc.. All rights reserved.