Class HtmlLinkExtractor
- java.lang.Object
-
- com.norconex.collector.http.link.AbstractLinkExtractor
-
- com.norconex.collector.http.link.AbstractTextLinkExtractor
-
- com.norconex.collector.http.link.impl.HtmlLinkExtractor
-
- All Implemented Interfaces:
ILinkExtractor
,IXMLConfigurable
- Direct Known Subclasses:
GenericLinkExtractor
public class HtmlLinkExtractor extends AbstractTextLinkExtractor
Html link extractor for URLs found in HTML and possibly other text files.
This link extractor uses regular expressions to extract links. It does so on a chunk of text at a time, so that large files are not fully loaded into memory. If you prefer a more flexible implementation that loads the DOM model in memory to perform link extraction, consider using
DOMLinkExtractor
.Applicable documents
By default, this extractor only will be applied on documents matching one of these content types:
You can specify your own content types or other restrictions with
AbstractLinkExtractor.setRestrictions(List)
. Make sure they represent a file with HTML-like markup tags containing URLs. For documents that are just too different, consider implementing your ownILinkExtractor
instead. Removing the default values and define no content types will have for effect to try to extract URLs from all files (usually a bad idea).Tags attributes
URLs are assumed to be contained within valid tags or tag attributes. The default tags and attributes used are (tag.attribute):a.href, frame.src, iframe.src, img.src, meta.http-equiv
You can specify your own set of tags and attributes to have different ones used for extracting URLs. For an elaborated set, you can combine the above with your own list or use any of the following suggestions (tag.attribute):applet.archive, applet.codebase, area.href, audio.src, base.href, blockquote.cite, body.background, button.formaction, command.icon, del.cite, embed.src, form.action, frame.longdesc, head.profile, html.manifest, iframe.longdesc, img.longdesc, img.usemap, input.formaction, input.src, input.usemap, ins.cite, link.href, object.archive, object.classid, object.codebase, object.data, object.usemap, q.cite, script.src, source.src, video.poster, video.src
The
meta.http-equiv
is treated differently. Only if the "http-equiv" value is refresh and a "content" tag with a URL exist that it will be extracted. "object" and "applet" can have multiple URLs.Since 2.2.0, it is possible to identify a tag only as the holder of a URL (without attributes). The tag body value will be used as the URL.
Referrer data
Some "referrer" information is derived from the each link and stored as metadata in the document they point to. These may vary for each link, but they are normally prefixed with
HttpDocMetadata.REFERRER_LINK_PREFIX
.Since 2.6.0, the referrer data is always stored (was optional before).
Character encoding
Since 2.4.0, this extractor will by default attempt to detect the encoding of the a page when extracting links and referrer information. If no charset could be detected, it falls back to UTF-8. It is also possible to dictate which encoding to use with
setCharset(String)
."nofollow"
By default, a regular HTML link having the "rel" attribute set to "nofollow" won't be extracted (e.g.
<a href="x.html" rel="nofollow" ...>
). To force its extraction (and ensure it is followed) you can setsetIgnoreNofollow(boolean)
totrue
.URL Fragments
Since 2.3.0, this extractor preserves hashtag characters (#) found in URLs and every characters after it. It relies on the implementation of
IURLNormalizer
to strip it if need be.GenericURLNormalizer
is now always invoked by default, and the default set of rules defined for it will remove fragments.The URL specification says hashtags are used to represent fragments only. That is, to quickly jump to a specific section of the page the URL represents. Under normal circumstances, keeping the URL fragments usually leads to duplicates documents being fetched (same URL but different fragment) and they should be stripped. Unfortunately, there are sites not following the URL standard and using hashtags as a regular part of a URL (i.e. different hashtags point to different web pages). It may be essential when crawling these sites to keep the URL fragments. This can be done by making sure the URL normalizer does not strip them.
Ignoring link data
By default, contextual information is kept about the HTML/XML mark-up tag from which a link is extracted (e.g., tag name and attributes). That information gets stored as metadata in the target document. If you want to limit the quantity of information extracted/stored, you can disable this feature by setting
ignoreLinkData
totrue
.URL Schemes
Since 2.4.0, only valid schemes are extracted for absolute URLs. By default, those are
http
,https
, andftp
. You can specify your own list of supported protocols withsetSchemes(String[])
.HTML/XML Comments
Since 2.6.0, URLs found in <!-- comments --> are no longer extracted by default. To enable URL extraction from comments, use
setCommentsEnabled(boolean)
Extract links in certain parts only
Since 2.8.0, you can identify portions of a document where links should be extracted or ignored with
setExtractBetweens(RegexPair...)
andsetNoExtractBetweens(RegexPair...)
. Eligible content for link extraction is identified first, and content to exclude is done on that subset.Since 2.9.0, you can further limit link extraction to specific area by using selector-syntax to do so, with
setExtractSelectors(String...)
andsetNoExtractSelectors(String...)
.XML configuration usage:
<extractor class="com.norconex.collector.http.link.impl.HtmlLinkExtractor" maxURLLength="(maximum URL length. Default is 2048)" ignoreNofollow="[false|true]" ignoreLinkData="[false|true]" commentsEnabled="[false|true]" charset="(supported character encoding)"> <fieldMatcher> (optional expression for fields used for links extraction instead of the document stream) </fieldMatcher> <schemes> (CSV list of URI scheme for which to perform link extraction. leave blank or remove tag to use defaults.) </schemes> <!-- Which tags and attributes hold the URLs to extract. --> <tags> <tag name="(tag name)" attribute="(tag attribute)"/> <!-- you can have multiple tag entries --> </tags> <!-- Only extract URLs from the following text portions. --> <extractBetween caseSensitive="[false|true]"> <start>(regex)</start> <end>(regex)</end> </extractBetween> <!-- you can have multiple extractBetween entries --> <!-- Do not extract URLs from the following text portions. --> <noExtractBetween caseSensitive="[false|true]"> <start>(regex)</start> <end>(regex)</end> </noExtractBetween> <!-- you can have multiple noExtractBetween entries --> <!-- Only extract URLs matching the following selectors. --> <extractSelector>(selector)</extractSelector> <!-- you can have multiple extractSelector entries --> <!-- Do not extract URLs matching the following selectors. --> <noExtractSelector>(selector)</noExtractSelector> <!-- you can have multiple noExtractSelector entries --> </extractor>
XML usage example:
<extractor class="com.norconex.collector.http.link.impl.HtmlLinkExtractor"> <tags> <tag name="a" attribute="href"/> <tag name="frame" attribute="src"/> <tag name="iframe" attribute="src"/> <tag name="img" attribute="src"/> <tag name="meta" attribute="http-equiv"/> <tag name="script" attribute="src"/> </tags> </extractor>
The above example adds URLs to JavaScript files to the list of URLs to be extracted.
- Since:
- 3.0.0 (refactored from GenericLinkExtractor)
- Author:
- Pascal Essiembre
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
HtmlLinkExtractor.RegexPair
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.static int
MAX_BUFFER_SIZE
static int
OVERLAP_SIZE
-
Constructor Summary
Constructors Constructor Description HtmlLinkExtractor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addExtractBetween(String start, String end, boolean caseSensitive)
Adds patterns delimiting a portion of a document to be considered for link extraction.void
addExtractSelectors(String... selectors)
Adds selectors matching the portions of a document to be considered for link extraction.void
addExtractSelectors(List<String> selectors)
Adds selectors matching the portions of a document to be considered for link extraction.void
addLinkTag(String tagName, String attribute)
void
addNoExtractBetween(String start, String end, boolean caseSensitive)
Adds patterns delimiting a portion of a document to be excluded from link extraction.void
addNoExtractSelectors(String... selectors)
Adds selectors matching the portions of a document to be excluded from link extraction.void
addNoExtractSelectors(List<String> selectors)
Adds selectors matching the portions of a document to be excluded from link extraction.void
clearLinkTags()
boolean
equals(Object other)
void
extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)
String
getCharset()
Gets the character set of pages on which link extraction is performed.List<HtmlLinkExtractor.RegexPair>
getExtractBetweens()
Gets the patterns delimiting the portions of a document to be considered for link extraction.List<String>
getExtractSelectors()
Gets the selectors matching the portions of a document to be considered for link extraction.int
getMaxURLLength()
Gets the maximum supported URL length.List<HtmlLinkExtractor.RegexPair>
getNoExtractBetweens()
Gets the patterns delimiting the portions of a document to be excluded from link extraction.List<String>
getNoExtractSelectors()
Gets the selectors matching the portions of a document to be excluded from link extraction.List<String>
getSchemes()
Gets the schemes to be extracted.int
hashCode()
boolean
isCommentsEnabled()
Gets whether links should be extracted from HTML/XML comments.boolean
isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.boolean
isIgnoreNofollow()
protected void
loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.void
removeLinkTag(String tagName, String attribute)
protected void
saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setCharset(String charset)
Sets the character set of pages on which link extraction is performed.void
setCommentsEnabled(boolean commentsEnabled)
Sets whether links should be extracted from HTML/XML comments.void
setExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
Sets the patterns delimiting the portions of a document to be considered for link extraction.void
setExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
Sets the patterns delimiting the portions of a document to be considered for link extraction.void
setExtractSelectors(String... selectors)
Sets the selectors matching the portions of a document to be considered for link extraction.void
setExtractSelectors(List<String> selectors)
Sets the selectors matching the portions of a document to be considered for link extraction.void
setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.void
setIgnoreNofollow(boolean ignoreNofollow)
void
setMaxURLLength(int maxURLLength)
Sets the maximum supported URL length.void
setNoExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
Sets the patterns delimiting the portions of a document to be excluded from link extraction.void
setNoExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
Sets the patterns delimiting the portions of a document to be excluded from link extraction.void
setNoExtractSelectors(String... selectors)
Sets the selectors matching the portions of a document to be excluded from link extraction.void
setNoExtractSelectors(List<String> selectors)
Sets the selectors matching the portions of a document to be excluded from link extraction.void
setSchemes(String... schemes)
Sets the schemes to be extracted.void
setSchemes(List<String> schemes)
Sets the schemes to be extracted.String
toString()
-
Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher
-
Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
-
-
-
Field Detail
-
MAX_BUFFER_SIZE
public static final int MAX_BUFFER_SIZE
- See Also:
- Constant Field Values
-
OVERLAP_SIZE
public static final int OVERLAP_SIZE
- See Also:
- Constant Field Values
-
DEFAULT_MAX_URL_LENGTH
public static final int DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.- See Also:
- Constant Field Values
-
-
Method Detail
-
extractTextLinks
public void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
- Specified by:
extractTextLinks
in classAbstractTextLinkExtractor
- Throws:
IOException
-
getMaxURLLength
public int getMaxURLLength()
Gets the maximum supported URL length.- Returns:
- maximum URL length
-
setMaxURLLength
public void setMaxURLLength(int maxURLLength)
Sets the maximum supported URL length.- Parameters:
maxURLLength
- maximum URL length
-
getExtractBetweens
public List<HtmlLinkExtractor.RegexPair> getExtractBetweens()
Gets the patterns delimiting the portions of a document to be considered for link extraction.- Returns:
- extract between patterns
- Since:
- 2.8.0
-
setExtractBetweens
public void setExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
Sets the patterns delimiting the portions of a document to be considered for link extraction.- Parameters:
betweens
- extract between patterns- Since:
- 2.8.0
-
setExtractBetweens
public void setExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
Sets the patterns delimiting the portions of a document to be considered for link extraction.- Parameters:
betweens
- extract between patterns- Since:
- 3.0.0
-
addExtractBetween
public void addExtractBetween(String start, String end, boolean caseSensitive)
Adds patterns delimiting a portion of a document to be considered for link extraction.- Parameters:
start
- pattern matching start of text portionend
- pattern matching end of text portioncaseSensitive
- whether the patterns are case sensitive or not- Since:
- 2.8.0
-
getNoExtractBetweens
public List<HtmlLinkExtractor.RegexPair> getNoExtractBetweens()
Gets the patterns delimiting the portions of a document to be excluded from link extraction.- Returns:
- extract between patterns
- Since:
- 2.8.0
-
setNoExtractBetweens
public void setNoExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
Sets the patterns delimiting the portions of a document to be excluded from link extraction.- Parameters:
betweens
- extract between patterns- Since:
- 2.8.0
-
setNoExtractBetweens
public void setNoExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
Sets the patterns delimiting the portions of a document to be excluded from link extraction.- Parameters:
betweens
- extract between patterns- Since:
- 3.0.0
-
addNoExtractBetween
public void addNoExtractBetween(String start, String end, boolean caseSensitive)
Adds patterns delimiting a portion of a document to be excluded from link extraction.- Parameters:
start
- pattern matching start of text portionend
- pattern matching end of text portioncaseSensitive
- whether the patterns are case sensitive or not- Since:
- 2.8.0
-
getExtractSelectors
public List<String> getExtractSelectors()
Gets the selectors matching the portions of a document to be considered for link extraction.- Returns:
- selectors
- Since:
- 2.9.0
-
setExtractSelectors
public void setExtractSelectors(String... selectors)
Sets the selectors matching the portions of a document to be considered for link extraction.- Parameters:
selectors
- selectors- Since:
- 2.9.0
-
setExtractSelectors
public void setExtractSelectors(List<String> selectors)
Sets the selectors matching the portions of a document to be considered for link extraction.- Parameters:
selectors
- selectors- Since:
- 3.0.0
-
addExtractSelectors
public void addExtractSelectors(String... selectors)
Adds selectors matching the portions of a document to be considered for link extraction.- Parameters:
selectors
- selectors- Since:
- 2.9.0
-
addExtractSelectors
public void addExtractSelectors(List<String> selectors)
Adds selectors matching the portions of a document to be considered for link extraction.- Parameters:
selectors
- selectors- Since:
- 3.0.0
-
getNoExtractSelectors
public List<String> getNoExtractSelectors()
Gets the selectors matching the portions of a document to be excluded from link extraction.- Returns:
- selectors
- Since:
- 2.9.0
-
setNoExtractSelectors
public void setNoExtractSelectors(String... selectors)
Sets the selectors matching the portions of a document to be excluded from link extraction.- Parameters:
selectors
- selectors- Since:
- 2.9.0
-
setNoExtractSelectors
public void setNoExtractSelectors(List<String> selectors)
Sets the selectors matching the portions of a document to be excluded from link extraction.- Parameters:
selectors
- selectors- Since:
- 3.0.0
-
addNoExtractSelectors
public void addNoExtractSelectors(String... selectors)
Adds selectors matching the portions of a document to be excluded from link extraction.- Parameters:
selectors
- selectors- Since:
- 2.9.0
-
addNoExtractSelectors
public void addNoExtractSelectors(List<String> selectors)
Adds selectors matching the portions of a document to be excluded from link extraction.- Parameters:
selectors
- selectors- Since:
- 3.0.0
-
isCommentsEnabled
public boolean isCommentsEnabled()
Gets whether links should be extracted from HTML/XML comments.- Returns:
true
if links should be extracted from comments.- Since:
- 2.6.0
-
setCommentsEnabled
public void setCommentsEnabled(boolean commentsEnabled)
Sets whether links should be extracted from HTML/XML comments.- Parameters:
commentsEnabled
-true
if links should be extracted from comments.- Since:
- 2.6.0
-
getSchemes
public List<String> getSchemes()
Gets the schemes to be extracted.- Returns:
- schemes to be extracted
- Since:
- 2.4.0
-
setSchemes
public void setSchemes(String... schemes)
Sets the schemes to be extracted.- Parameters:
schemes
- schemes to be extracted- Since:
- 2.4.0
-
setSchemes
public void setSchemes(List<String> schemes)
Sets the schemes to be extracted.- Parameters:
schemes
- schemes to be extracted- Since:
- 3.0.0
-
isIgnoreNofollow
public boolean isIgnoreNofollow()
-
setIgnoreNofollow
public void setIgnoreNofollow(boolean ignoreNofollow)
-
isIgnoreLinkData
public boolean isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.- Returns:
true
to ignore.- Since:
- 3.0.0
-
setIgnoreLinkData
public void setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.- Parameters:
ignoreLinkData
-true
to ignore.- Since:
- 3.0.0
-
getCharset
public String getCharset()
Gets the character set of pages on which link extraction is performed. Default isnull
(charset detection will be attempted).- Returns:
- character set to use, or
null
- Since:
- 2.4.0
-
setCharset
public void setCharset(String charset)
Sets the character set of pages on which link extraction is performed. Not specifying any (null
) will attempt charset detection.- Parameters:
charset
- character set to use, ornull
- Since:
- 2.4.0
-
clearLinkTags
public void clearLinkTags()
-
loadTextLinkExtractorFromXML
protected void loadTextLinkExtractorFromXML(XML xml)
Description copied from class:AbstractTextLinkExtractor
Loads configuration settings specific to the implementing class.- Specified by:
loadTextLinkExtractorFromXML
in classAbstractTextLinkExtractor
- Parameters:
xml
- XML configuration
-
saveTextLinkExtractorToXML
protected void saveTextLinkExtractorToXML(XML xml)
Description copied from class:AbstractTextLinkExtractor
Saves configuration settings specific to the implementing class.- Specified by:
saveTextLinkExtractorToXML
in classAbstractTextLinkExtractor
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractTextLinkExtractor
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractTextLinkExtractor
-
toString
public String toString()
- Overrides:
toString
in classAbstractTextLinkExtractor
-
-