public class GenericLinkExtractor extends Object implements ILinkExtractor, IXMLConfigurable
text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-aspYou can specify your own content types if you know they represent a file with HTML-like markup tags containing URLs. For documents that are just too different, consider implementing your own
ILinkExtractor
instead.
Removing the default values and define no content types will have for effect
to try to extract URLs from all files (usually a bad idea).
a.href, frame.src, iframe.src, img.src, meta.http-equivYou can specify your own set of tags and attributes to have different ones used for extracting URLs. For an elaborated set, you can combine the above with your own list or use any of the following suggestions (tag.attribute):
applet.archive, applet.codebase, area.href, audio.src, base.href, blockquote.cite, body.background, button.formaction, command.icon, del.cite, embed.src, form.action, frame.longdesc, head.profile, html.manifest, iframe.longdesc, img.longdesc, img.usemap, input.formaction, input.src, input.usemap, ins.cite, link.href, object.archive, object.classid, object.codebase, object.data, object.usemap, q.cite, script.src, source.src, video.poster, video.src
The meta.http-equiv
is treated differently. Only if the
"http-equiv" value is refresh and a "content" tag with a URL exist that it
will be extracted. "object" and "applet" can have multiple URLs.
Since 2.2.0, it is possible to identify a tag only as the holder of a URL (without attributes). The tag body value will be used as the URL.
The following referrer information is stored as metadata in each document represented by the extracted URLs:
HttpMetadata.COLLECTOR_REFERRER_REFERENCE
.HttpMetadata.COLLECTOR_REFERRER_LINK_TAG
.<a href=""></a>
tags of the referrer document.
Can be useful to help establish better document titles.
Metadata value is
HttpMetadata.COLLECTOR_REFERRER_LINK_TEXT
.title
attribute of the
link that contained the document reference (URL) in referrer's content.
Can also be useful to help establish better document titles.
Metadata value is
HttpMetadata.COLLECTOR_REFERRER_LINK_TITLE
.Since 2.6.0, the referrer data is always stored (was optional before).
Since 2.4.0, this extractor will by default attempt to
detect the encoding of the a page when extracting links and
referrer information. If no charset could be detected, it falls back to
UTF-8. It is also possible to dictate which encoding to use with
setCharset(String)
.
By default, a regular HTML link having the "rel" attribute set to "nofollow"
won't be extracted (e.g.
<a href="x.html" rel="nofollow" ...>
).
To force its extraction (and ensure it is followed) you can set
setIgnoreNofollow(boolean)
to true
.
Since 2.6.0 it is possible to treat all the links in certain pages
as "nofollow" links. Link extraction is essentially skipped for URLs matching
the patterns set in setNofollowPatterns(List)
.
Since 2.3.0, this extractor preserves hashtag characters (#) found
in URLs and every characters after it. It relies on the implementation
of IURLNormalizer
to strip it if need be.
GenericURLNormalizer
is now always invoked by default, and the
default set of rules defined for it will remove fragments.
The URL specification says hashtags are used to represent fragments only. That is, to quickly jump to a specific section of the page the URL represents. Under normal circumstances, keeping the URL fragments usually leads to duplicates documents being fetched (same URL but different fragment) and they should be stripped. Unfortunately, there are sites not following the URL standard and using hashtags as a regular part of a URL (i.e. different hashtags point to different web pages). It may be essential when crawling these sites to keep the URL fragments. This can be done by making sure the URL normalizer does not strip them.
Since 2.4.0, only valid
schemes are extracted for absolute URLs. By default, those are
http
, https
, and ftp
. You can
specify your own list of supported protocols with
setSchemes(String[])
.
Since 2.6.0, URLs found in <!-- comments --> are no longer
extracted by default. To enable URL extraction from comments, use
setCommentsEnabled(boolean)
Since 2.8.0, you can identify portions of a document where links
should be extracted or ignored with
setExtractBetweens(RegexPair...)
and
setNoExtractBetweens(RegexPair...)
. Eligible content for link
extraction is identified first, and content to exclude is done on that
subset.
Since 2.9.0, you can further limit link extraction to specific
area by using
selector-syntax
to do so, with
setExtractSelectors(String...)
and
setNoExtractSelectors(String...)
.
<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" maxURLLength="(maximum URL length. Default is 2048)" ignoreNofollow="[false|true]" commentsEnabled="[false|true]" charset="(supported character encoding)" > <contentTypes> (CSV list of content types on which to perform link extraction. leave blank or remove tag to use defaults.) </contentTypes> <schemes> (CSV list of URI scheme for which to perform link extraction. leave blank or remove tag to use defaults.) </schemes> <!-- Which tags and attributes hold the URLs to extract. --> <tags> <tag name="(tag name)" attribute="(tag attribute)" /> <!-- you can have multiple tag entries --> </tags> <!-- Only extract URLs from the following text portions. --> <extractBetween caseSensitive="[false|true]"> <start>(regex)</start> <end>(regex)</end> </extractBetween> <!-- you can have multiple extractBetween entries --> <!-- Do not extract URLs from the following text portions. --> <noExtractBetween caseSensitive="[false|true]"> <start>(regex)</start> <end>(regex)</end> </noExtractBetween> <!-- you can have multiple noExtractBetween entries --> <!-- Only extract URLs matching the following selectors. --> <extractSelector>(selector)</extractSelector> <!-- you can have multiple extractSelector entries --> <!-- Do not extract URLs matching the following selectors. --> <noExtractSelector>(selector)</noExtractSelector> <!-- you can have multiple noExtractSelector entries --> </extractor>
The following adds URLs to JavaScript files to the list of URLs to be extracted.
<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor"> <tags> <tag name="a" attribute="href" /> <tag name="frame" attribute="src" /> <tag name="iframe" attribute="src" /> <tag name="img" attribute="src" /> <tag name="meta" attribute="http-equiv" /> <tag name="script" attribute="src" /> </tags> </extractor>
Modifier and Type | Class and Description |
---|---|
static class |
GenericLinkExtractor.RegexPair |
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.
|
static int |
MAX_BUFFER_SIZE |
static int |
OVERLAP_SIZE |
Constructor and Description |
---|
GenericLinkExtractor() |
Modifier and Type | Method and Description |
---|---|
boolean |
accepts(String url,
ContentType contentType)
Whether this link extraction should be executed for the given URL
and/or content type.
|
void |
addExtractBetween(String start,
String end,
boolean caseSensitive)
Adds patterns delimiting a portion of a document to be considered
for link extraction.
|
void |
addExtractSelectors(String... selectors)
Adds selectors matching the portions of a document to be considered
for link extraction.
|
void |
addLinkTag(String tagName,
String attribute) |
void |
addNoExtractBetween(String start,
String end,
boolean caseSensitive)
Adds patterns delimiting a portion of a document to be excluded
from link extraction.
|
void |
addNoExtractSelectors(String... selectors)
Adds selectors matching the portions of a document to be excluded
from link extraction.
|
void |
addNofollowPatterns(String regex)
Adds a pattern for references for which link extraction is disabled.
|
void |
clearLinkTags() |
boolean |
equals(Object other) |
Set<Link> |
extractLinks(InputStream input,
String reference,
ContentType contentType)
Extracts links from a document.
|
String |
getCharset()
Gets the character set of pages on which link extraction is performed.
|
ContentType[] |
getContentTypes() |
GenericLinkExtractor.RegexPair[] |
getExtractBetweens()
Gets the patterns delimiting the portions of a document to be considered
for link extraction.
|
String[] |
getExtractSelectors()
Gets the selectors matching the portions of a document to be considered
for link extraction.
|
int |
getMaxURLLength()
Gets the maximum supported URL length.
|
GenericLinkExtractor.RegexPair[] |
getNoExtractBetweens()
Gets the patterns delimiting the portions of a document to be excluded
from link extraction.
|
String[] |
getNoExtractSelectors()
Gets the selectors matching the portions of a document to be excluded
from link extraction.
|
List<String> |
getNofollowPatterns()
Gets the patterns of references for which link extraction is disabled.
|
String[] |
getSchemes()
Gets the schemes to be extracted.
|
int |
hashCode() |
boolean |
isCommentsEnabled()
Gets whether links should be extracted from HTML/XML comments.
|
boolean |
isIgnoreNofollow() |
boolean |
isKeepReferrerData()
Deprecated.
Since 2.6.0, referrer data is always kept
|
void |
loadFromXML(Reader in) |
void |
removeLinkTag(String tagName,
String attribute) |
void |
saveToXML(Writer out) |
void |
setCharset(String charset)
Sets the character set of pages on which link extraction is performed.
|
void |
setCommentsEnabled(boolean commentsEnabled)
Sets whether links should be extracted from HTML/XML comments.
|
void |
setContentTypes(ContentType... contentTypes) |
void |
setExtractBetweens(GenericLinkExtractor.RegexPair... betweens)
Sets the patterns delimiting the portions of a document to be considered
for link extraction.
|
void |
setExtractSelectors(String... selectors)
Sets the selectors matching the portions of a document to be considered
for link extraction.
|
void |
setIgnoreNofollow(boolean ignoreNofollow) |
void |
setKeepReferrerData(boolean keepReferrerData)
Deprecated.
Since 2.6.0, referrer data is always kept
|
void |
setMaxURLLength(int maxURLLength)
Sets the maximum supported URL length.
|
void |
setNoExtractBetweens(GenericLinkExtractor.RegexPair... betweens)
Sets the patterns delimiting the portions of a document to be excluded
from link extraction.
|
void |
setNoExtractSelectors(String... selectors)
Sets the selectors matching the portions of a document to be excluded
from link extraction.
|
void |
setNofollowPatterns(List<String> patterns)
Sets the patterns of references for which link extraction is disabled.
|
void |
setSchemes(String... schemes)
Sets the schemes to be extracted.
|
String |
toString() |
public static final int MAX_BUFFER_SIZE
public static final int OVERLAP_SIZE
public static final int DEFAULT_MAX_URL_LENGTH
public Set<Link> extractLinks(InputStream input, String reference, ContentType contentType) throws IOException
ILinkExtractor
extractLinks
in interface ILinkExtractor
input
- the document input streamreference
- document reference (URL)contentType
- the document content typeIOException
- problem reading the documentpublic boolean accepts(String url, ContentType contentType)
ILinkExtractor
accepts
in interface ILinkExtractor
url
- the urlcontentType
- the content typetrue
if the given URL is acceptedpublic int getMaxURLLength()
public void setMaxURLLength(int maxURLLength)
maxURLLength
- maximum URL lengthpublic ContentType[] getContentTypes()
public void setContentTypes(ContentType... contentTypes)
public GenericLinkExtractor.RegexPair[] getExtractBetweens()
public void setExtractBetweens(GenericLinkExtractor.RegexPair... betweens)
betweens
- extract between patternspublic void addExtractBetween(String start, String end, boolean caseSensitive)
start
- pattern matching start of text portionend
- pattern matching end of text portioncaseSensitive
- whether the patterns are case sensitive or notpublic GenericLinkExtractor.RegexPair[] getNoExtractBetweens()
public void setNoExtractBetweens(GenericLinkExtractor.RegexPair... betweens)
betweens
- extract between patternspublic void addNoExtractBetween(String start, String end, boolean caseSensitive)
start
- pattern matching start of text portionend
- pattern matching end of text portioncaseSensitive
- whether the patterns are case sensitive or notpublic String[] getExtractSelectors()
public void setExtractSelectors(String... selectors)
selectors
- selectorspublic void addExtractSelectors(String... selectors)
selectors
- selectorspublic String[] getNoExtractSelectors()
public void setNoExtractSelectors(String... selectors)
selectors
- selectorspublic void addNoExtractSelectors(String... selectors)
selectors
- selectorspublic List<String> getNofollowPatterns()
public void setNofollowPatterns(List<String> patterns)
patterns
- the list of regex URL patternspublic void addNofollowPatterns(String regex)
regex
- the regex URL patternpublic boolean isCommentsEnabled()
true
if links should be extracted from comments.public void setCommentsEnabled(boolean commentsEnabled)
commentsEnabled
- true
if links
should be extracted from comments.public String[] getSchemes()
public void setSchemes(String... schemes)
schemes
- schemes to be extractedpublic boolean isIgnoreNofollow()
public void setIgnoreNofollow(boolean ignoreNofollow)
@Deprecated public boolean isKeepReferrerData()
true
@Deprecated public void setKeepReferrerData(boolean keepReferrerData)
keepReferrerData
- referrer datapublic String getCharset()
null
(charset detection will be attempted).null
public void setCharset(String charset)
null
) will attempt charset detection.charset
- character set to use, or null
public void clearLinkTags()
public void loadFromXML(Reader in)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.