public class HtmlLinkExtractor extends AbstractTextLinkExtractor
Html link extractor for URLs found in HTML and possibly other text files.
This link extractor uses regular expressions to extract links. It does
so on a chunk of text at a time, so that large files are not fully loaded
into memory. If you prefer a more flexible implementation that loads the
DOM model in memory to perform link extraction, consider using
DOMLinkExtractor
.
By default, this extractor only will be applied on documents matching one of these content types:
You can specify your own content types or other restrictions with
AbstractLinkExtractor.setRestrictions(List)
.
Make sure they represent a file with HTML-like markup tags containing URLs.
For documents that are just
too different, consider implementing your own ILinkExtractor
instead.
Removing the default values and define no content types will have for effect
to try to extract URLs from all files (usually a bad idea).
a.href, frame.src, iframe.src, img.src, meta.http-equivYou can specify your own set of tags and attributes to have different ones used for extracting URLs. For an elaborated set, you can combine the above with your own list or use any of the following suggestions (tag.attribute):
applet.archive, applet.codebase, area.href, audio.src, base.href, blockquote.cite, body.background, button.formaction, command.icon, del.cite, embed.src, form.action, frame.longdesc, head.profile, html.manifest, iframe.longdesc, img.longdesc, img.usemap, input.formaction, input.src, input.usemap, ins.cite, link.href, object.archive, object.classid, object.codebase, object.data, object.usemap, q.cite, script.src, source.src, video.poster, video.src
The meta.http-equiv
is treated differently. Only if the
"http-equiv" value is refresh and a "content" tag with a URL exist that it
will be extracted. "object" and "applet" can have multiple URLs.
Since 2.2.0, it is possible to identify a tag only as the holder of a URL (without attributes). The tag body value will be used as the URL.
Some "referrer" information is derived from the each link and stored as
metadata in the document they point to.
These may vary for each link, but they are normally prefixed with
HttpDocMetadata.REFERRER_LINK_PREFIX
.
Since 2.6.0, the referrer data is always stored (was optional before).
Since 2.4.0, this extractor will by default attempt to
detect the encoding of the a page when extracting links and
referrer information. If no charset could be detected, it falls back to
UTF-8. It is also possible to dictate which encoding to use with
setCharset(String)
.
By default, a regular HTML link having the "rel" attribute set to "nofollow"
won't be extracted (e.g.
<a href="x.html" rel="nofollow" ...>
).
To force its extraction (and ensure it is followed) you can set
setIgnoreNofollow(boolean)
to true
.
Since 2.3.0, this extractor preserves hashtag characters (#) found
in URLs and every characters after it. It relies on the implementation
of IURLNormalizer
to strip it if need be.
GenericURLNormalizer
is now always invoked by default, and the
default set of rules defined for it will remove fragments.
The URL specification says hashtags are used to represent fragments only. That is, to quickly jump to a specific section of the page the URL represents. Under normal circumstances, keeping the URL fragments usually leads to duplicates documents being fetched (same URL but different fragment) and they should be stripped. Unfortunately, there are sites not following the URL standard and using hashtags as a regular part of a URL (i.e. different hashtags point to different web pages). It may be essential when crawling these sites to keep the URL fragments. This can be done by making sure the URL normalizer does not strip them.
By default, contextual information is kept about the HTML/XML mark-up
tag from which a link is extracted (e.g., tag name and attributes).
That information gets stored as metadata in the target document.
If you want to limit the quantity of information extracted/stored,
you can disable this feature by setting
ignoreLinkData
to true
.
Since 2.4.0, only valid
schemes are extracted for absolute URLs. By default, those are
http
, https
, and ftp
. You can
specify your own list of supported protocols with
setSchemes(String[])
.
Since 2.6.0, URLs found in <!-- comments --> are no longer
extracted by default. To enable URL extraction from comments, use
setCommentsEnabled(boolean)
Since 2.8.0, you can identify portions of a document where links
should be extracted or ignored with
setExtractBetweens(RegexPair...)
and
setNoExtractBetweens(RegexPair...)
. Eligible content for link
extraction is identified first, and content to exclude is done on that
subset.
Since 2.9.0, you can further limit link extraction to specific
area by using
selector-syntax
to do so, with
setExtractSelectors(String...)
and
setNoExtractSelectors(String...)
.
<extractor
class="com.norconex.collector.http.link.impl.HtmlLinkExtractor"
maxURLLength="(maximum URL length. Default is 2048)"
ignoreNofollow="[false|true]"
ignoreLinkData="[false|true]"
commentsEnabled="[false|true]"
charset="(supported character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(optional expression for fields used for links extraction instead
of the document stream)
</fieldMatcher>
<schemes>
(CSV list of URI scheme for which to perform link extraction.
leave blank or remove tag to use defaults.)
</schemes>
<!-- Which tags and attributes hold the URLs to extract. -->
<tags>
<tag
name="(tag name)"
attribute="(tag attribute)"/>
<!-- you can have multiple tag entries -->
</tags>
<!-- Only extract URLs from the following text portions. -->
<extractBetween
caseSensitive="[false|true]">
<start>(regex)</start>
<end>(regex)</end>
</extractBetween>
<!-- you can have multiple extractBetween entries -->
<!-- Do not extract URLs from the following text portions. -->
<noExtractBetween
caseSensitive="[false|true]">
<start>(regex)</start>
<end>(regex)</end>
</noExtractBetween>
<!-- you can have multiple noExtractBetween entries -->
<!-- Only extract URLs matching the following selectors. -->
<extractSelector>(selector)</extractSelector>
<!-- you can have multiple extractSelector entries -->
<!-- Do not extract URLs matching the following selectors. -->
<noExtractSelector>(selector)</noExtractSelector>
<!-- you can have multiple noExtractSelector entries -->
</extractor>
<extractor
class="com.norconex.collector.http.link.impl.HtmlLinkExtractor">
<tags>
<tag
name="a"
attribute="href"/>
<tag
name="frame"
attribute="src"/>
<tag
name="iframe"
attribute="src"/>
<tag
name="img"
attribute="src"/>
<tag
name="meta"
attribute="http-equiv"/>
<tag
name="script"
attribute="src"/>
</tags>
</extractor>
The above example adds URLs to JavaScript files to the list of URLs to be extracted.
Modifier and Type | Class and Description |
---|---|
static class |
HtmlLinkExtractor.RegexPair |
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.
|
static int |
MAX_BUFFER_SIZE |
static int |
OVERLAP_SIZE |
Constructor and Description |
---|
HtmlLinkExtractor() |
Modifier and Type | Method and Description |
---|---|
void |
addExtractBetween(String start,
String end,
boolean caseSensitive)
Adds patterns delimiting a portion of a document to be considered
for link extraction.
|
void |
addExtractSelectors(List<String> selectors)
Adds selectors matching the portions of a document to be considered
for link extraction.
|
void |
addExtractSelectors(String... selectors)
Adds selectors matching the portions of a document to be considered
for link extraction.
|
void |
addLinkTag(String tagName,
String attribute) |
void |
addNoExtractBetween(String start,
String end,
boolean caseSensitive)
Adds patterns delimiting a portion of a document to be excluded
from link extraction.
|
void |
addNoExtractSelectors(List<String> selectors)
Adds selectors matching the portions of a document to be excluded
from link extraction.
|
void |
addNoExtractSelectors(String... selectors)
Adds selectors matching the portions of a document to be excluded
from link extraction.
|
void |
clearLinkTags() |
boolean |
equals(Object other) |
void |
extractTextLinks(Set<Link> links,
HandlerDoc doc,
Reader reader) |
String |
getCharset()
Gets the character set of pages on which link extraction is performed.
|
List<HtmlLinkExtractor.RegexPair> |
getExtractBetweens()
Gets the patterns delimiting the portions of a document to be considered
for link extraction.
|
List<String> |
getExtractSelectors()
Gets the selectors matching the portions of a document to be considered
for link extraction.
|
int |
getMaxURLLength()
Gets the maximum supported URL length.
|
List<HtmlLinkExtractor.RegexPair> |
getNoExtractBetweens()
Gets the patterns delimiting the portions of a document to be excluded
from link extraction.
|
List<String> |
getNoExtractSelectors()
Gets the selectors matching the portions of a document to be excluded
from link extraction.
|
List<String> |
getSchemes()
Gets the schemes to be extracted.
|
int |
hashCode() |
boolean |
isCommentsEnabled()
Gets whether links should be extracted from HTML/XML comments.
|
boolean |
isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.
|
boolean |
isIgnoreNofollow() |
protected void |
loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
void |
removeLinkTag(String tagName,
String attribute) |
protected void |
saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setCharset(String charset)
Sets the character set of pages on which link extraction is performed.
|
void |
setCommentsEnabled(boolean commentsEnabled)
Sets whether links should be extracted from HTML/XML comments.
|
void |
setExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
Sets the patterns delimiting the portions of a document to be considered
for link extraction.
|
void |
setExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
Sets the patterns delimiting the portions of a document to be considered
for link extraction.
|
void |
setExtractSelectors(List<String> selectors)
Sets the selectors matching the portions of a document to be considered
for link extraction.
|
void |
setExtractSelectors(String... selectors)
Sets the selectors matching the portions of a document to be considered
for link extraction.
|
void |
setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.
|
void |
setIgnoreNofollow(boolean ignoreNofollow) |
void |
setMaxURLLength(int maxURLLength)
Sets the maximum supported URL length.
|
void |
setNoExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
Sets the patterns delimiting the portions of a document to be excluded
from link extraction.
|
void |
setNoExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
Sets the patterns delimiting the portions of a document to be excluded
from link extraction.
|
void |
setNoExtractSelectors(List<String> selectors)
Sets the selectors matching the portions of a document to be excluded
from link extraction.
|
void |
setNoExtractSelectors(String... selectors)
Sets the selectors matching the portions of a document to be excluded
from link extraction.
|
void |
setSchemes(List<String> schemes)
Sets the schemes to be extracted.
|
void |
setSchemes(String... schemes)
Sets the schemes to be extracted.
|
String |
toString() |
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
public static final int MAX_BUFFER_SIZE
public static final int OVERLAP_SIZE
public static final int DEFAULT_MAX_URL_LENGTH
public void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
extractTextLinks
in class AbstractTextLinkExtractor
IOException
public int getMaxURLLength()
public void setMaxURLLength(int maxURLLength)
maxURLLength
- maximum URL lengthpublic List<HtmlLinkExtractor.RegexPair> getExtractBetweens()
public void setExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
betweens
- extract between patternspublic void setExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
betweens
- extract between patternspublic void addExtractBetween(String start, String end, boolean caseSensitive)
start
- pattern matching start of text portionend
- pattern matching end of text portioncaseSensitive
- whether the patterns are case sensitive or notpublic List<HtmlLinkExtractor.RegexPair> getNoExtractBetweens()
public void setNoExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
betweens
- extract between patternspublic void setNoExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
betweens
- extract between patternspublic void addNoExtractBetween(String start, String end, boolean caseSensitive)
start
- pattern matching start of text portionend
- pattern matching end of text portioncaseSensitive
- whether the patterns are case sensitive or notpublic List<String> getExtractSelectors()
public void setExtractSelectors(String... selectors)
selectors
- selectorspublic void setExtractSelectors(List<String> selectors)
selectors
- selectorspublic void addExtractSelectors(String... selectors)
selectors
- selectorspublic void addExtractSelectors(List<String> selectors)
selectors
- selectorspublic List<String> getNoExtractSelectors()
public void setNoExtractSelectors(String... selectors)
selectors
- selectorspublic void setNoExtractSelectors(List<String> selectors)
selectors
- selectorspublic void addNoExtractSelectors(String... selectors)
selectors
- selectorspublic void addNoExtractSelectors(List<String> selectors)
selectors
- selectorspublic boolean isCommentsEnabled()
true
if links should be extracted from comments.public void setCommentsEnabled(boolean commentsEnabled)
commentsEnabled
- true
if links
should be extracted from comments.public List<String> getSchemes()
public void setSchemes(String... schemes)
schemes
- schemes to be extractedpublic void setSchemes(List<String> schemes)
schemes
- schemes to be extractedpublic boolean isIgnoreNofollow()
public void setIgnoreNofollow(boolean ignoreNofollow)
public boolean isIgnoreLinkData()
true
to ignore.public void setIgnoreLinkData(boolean ignoreLinkData)
ignoreLinkData
- true
to ignore.public String getCharset()
null
(charset detection will be attempted).null
public void setCharset(String charset)
null
) will attempt charset detection.charset
- character set to use, or null
public void clearLinkTags()
protected void loadTextLinkExtractorFromXML(XML xml)
AbstractTextLinkExtractor
loadTextLinkExtractorFromXML
in class AbstractTextLinkExtractor
xml
- XML configurationprotected void saveTextLinkExtractorToXML(XML xml)
AbstractTextLinkExtractor
saveTextLinkExtractorToXML
in class AbstractTextLinkExtractor
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractTextLinkExtractor
public int hashCode()
hashCode
in class AbstractTextLinkExtractor
public String toString()
toString
in class AbstractTextLinkExtractor
Copyright © 2009–2023 Norconex Inc.. All rights reserved.