Class DOMLinkExtractor
- java.lang.Object
-
- com.norconex.collector.http.link.AbstractLinkExtractor
-
- com.norconex.collector.http.link.AbstractTextLinkExtractor
-
- com.norconex.collector.http.link.impl.DOMLinkExtractor
-
- All Implemented Interfaces:
ILinkExtractor
,IXMLConfigurable
public class DOMLinkExtractor extends AbstractTextLinkExtractor
Extracts links from a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content based on values of matching elements and attributes.
In order to construct a DOM tree, text is loaded entirely into memory. It uses the document content by default, but it can also come from specified metadata fields. Use this filter with caution if you know you'll need to parse huge files. Use the
HtmlLinkExtractor
instead if this is a concern.The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
This link extractor is normally used before importing.
When used before importing this class attempts to detect the content character encoding unless the character encoding was specified using
setCharset(String)
. Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.
Matching links
You can define as many JSoup "selectors" as desired. All values matched by a selector will be extracted as a URL.
It is possible to control what gets extracted exactly for matching purposes thanks to the "extract" argument expected with every selector. Possible values are:
When not specified, the default is "text".
The default selectors / extract strategies are:
- a[href] / attr(href)
- [src] / attr(src)
- link[href] / attr(href)
- meta[http-equiv='refresh'] / attr(content)
For any extracted link values, this extractor will perform minimal heuristics to clean extra content not part of a regular URL. For instance, it will only keep what is after
url=
when dealing with<meta http-equiv
refresh URLs. It will also trim white spaces.Ignoring link data
By default, contextual information is kept about the HTML/XML mark-up tag from which a link is extracted (e.g., tag name and attributes). That information gets stored as metadata in the target document. If you want to limit the quantity of information extracted/stored, you can disable this feature by setting
ignoreLinkData
totrue
.URL Schemes
Only valid schemes are extracted for absolute URLs. By default, those are
http
,https
, andftp
. You can specify your own list of supported protocols withsetSchemes(String[])
.Applicable documents
By default, this extractor only will be applied on documents matching one of these content types:
"nofollow"
By default, a regular HTML link having the "rel" attribute set to "nofollow" won't be extracted (e.g.
<a href="x.html" rel="nofollow" ...>
). To force its extraction (and ensure it is followed) you can setsetIgnoreNofollow(boolean)
totrue
.XML configuration usage:
<extractor class="com.norconex.collector.http.link.impl.DOMLinkExtractor" ignoreNofollow="[false|true]" ignoreLinkData="[false|true]" parser="[html|xml]" charset="(supported character encoding)"> <fieldMatcher> (optional expression for fields used for links extraction instead of the document stream) </fieldMatcher> <schemes> (CSV list of URI scheme for which to perform link extraction. leave blank or remove tag to use defaults.) </schemes> <!-- Repeat as needed: --> <linkSelector>(selector syntax)</linkSelector> <!-- Optional. Only apply link selectors to portions of a document matching these selectors. Repeat as needed. --> <extractSelector>(selector syntax)</extractSelector> <!-- Optional. Do not apply link selectors to portions of a document matching these selectors. Repeat as needed. --> <noExtractSelector>(selector syntax)</noExtractSelector> </extractor>
XML usage example:
<extractor class="com.norconex.collector.http.link.impl.DOMLinkExtractor"> <linkSelector extract="attr(href)"> a[href] </linkSelector> <linkSelector extract="attr(src)"> [src] </linkSelector> <linkSelector extract="attr(href)"> link[href] </linkSelector> <linkSelector extract="attr(content)"> meta[http-equiv='refresh'] </linkSelector> <linkSelector extract="attr(data-myurl)"> [data-myurl] </linkSelector> </extractor>
The above example will extract URLs found in custom element attributes named
data-myurl
.- Since:
- 3.0.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description DOMLinkExtractor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addExtractSelectors(String... selectors)
void
addExtractSelectors(List<String> selectors)
void
addLinkSelector(String selector)
Adds a new link selector extracting the "text" from matches.void
addLinkSelector(String selector, String extract)
void
addNoExtractSelectors(String... selectors)
void
addNoExtractSelectors(List<String> selectors)
void
clearLinkSelectors()
boolean
equals(Object other)
void
extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)
String
getCharset()
Gets the assumed source character encoding.List<String>
getExtractSelectors()
List<String>
getNoExtractSelectors()
String
getParser()
Gets the parser to use when creating the DOM-tree.List<String>
getSchemes()
Gets the schemes to be extracted.int
hashCode()
boolean
isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.boolean
isIgnoreNofollow()
protected void
loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.void
removeLinkSelector(String selector)
protected void
saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setCharset(String charset)
Sets the assumed source character encoding.void
setExtractSelectors(String... selectors)
void
setExtractSelectors(List<String> selectors)
void
setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.void
setIgnoreNofollow(boolean ignoreNofollow)
void
setNoExtractSelectors(String... selectors)
void
setNoExtractSelectors(List<String> selectors)
void
setParser(String parser)
Sets the parser to use when creating the DOM-tree.void
setSchemes(String... schemes)
Sets the schemes to be extracted.void
setSchemes(List<String> schemes)
Sets the schemes to be extracted.String
toString()
-
Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher
-
Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
-
-
-
Method Detail
-
getCharset
public String getCharset()
Gets the assumed source character encoding.- Returns:
- character encoding of the source to be transformed
-
setCharset
public void setCharset(String charset)
Sets the assumed source character encoding.- Parameters:
charset
- character encoding of the source to be transformed
-
getParser
public String getParser()
Gets the parser to use when creating the DOM-tree.- Returns:
html
(default) orxml
.
-
setParser
public void setParser(String parser)
Sets the parser to use when creating the DOM-tree.- Parameters:
parser
-html
orxml
.
-
isIgnoreNofollow
public boolean isIgnoreNofollow()
-
setIgnoreNofollow
public void setIgnoreNofollow(boolean ignoreNofollow)
-
isIgnoreLinkData
public boolean isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.- Returns:
true
to ignore.
-
setIgnoreLinkData
public void setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.- Parameters:
ignoreLinkData
-true
to ignore.
-
addLinkSelector
public void addLinkSelector(String selector)
Adds a new link selector extracting the "text" from matches.- Parameters:
selector
- JSoup selector
-
removeLinkSelector
public void removeLinkSelector(String selector)
-
clearLinkSelectors
public void clearLinkSelectors()
-
setExtractSelectors
public void setExtractSelectors(String... selectors)
-
addExtractSelectors
public void addExtractSelectors(String... selectors)
-
setNoExtractSelectors
public void setNoExtractSelectors(String... selectors)
-
addNoExtractSelectors
public void addNoExtractSelectors(String... selectors)
-
getSchemes
public List<String> getSchemes()
Gets the schemes to be extracted.- Returns:
- schemes to be extracted
-
setSchemes
public void setSchemes(String... schemes)
Sets the schemes to be extracted.- Parameters:
schemes
- schemes to be extracted
-
setSchemes
public void setSchemes(List<String> schemes)
Sets the schemes to be extracted.- Parameters:
schemes
- schemes to be extracted
-
extractTextLinks
public void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
- Specified by:
extractTextLinks
in classAbstractTextLinkExtractor
- Throws:
IOException
-
loadTextLinkExtractorFromXML
protected void loadTextLinkExtractorFromXML(XML xml)
Description copied from class:AbstractTextLinkExtractor
Loads configuration settings specific to the implementing class.- Specified by:
loadTextLinkExtractorFromXML
in classAbstractTextLinkExtractor
- Parameters:
xml
- XML configuration
-
saveTextLinkExtractorToXML
protected void saveTextLinkExtractorToXML(XML xml)
Description copied from class:AbstractTextLinkExtractor
Saves configuration settings specific to the implementing class.- Specified by:
saveTextLinkExtractorToXML
in classAbstractTextLinkExtractor
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractTextLinkExtractor
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractTextLinkExtractor
-
toString
public String toString()
- Overrides:
toString
in classAbstractTextLinkExtractor
-
-