public class DOMLinkExtractor extends AbstractTextLinkExtractor
Extracts links from a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content based on values of matching elements and attributes.
In order to construct a DOM tree, text is loaded entirely
into memory. It uses the document content by default, but it can also
come from specified metadata fields.
Use this filter with caution if you know you'll need to parse
huge files. Use the HtmlLinkExtractor
instead if this is a
concern.
The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
This link extractor is normally used before importing.
When used before importing this class attempts to detect the content
character encoding unless the character encoding
was specified using setCharset(String)
. Since document
parsing converts content to UTF-8, UTF-8 is always assumed when
used as a post-parse handler.
You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.
You can define as many JSoup "selectors" as desired. All values matched by a selector will be extracted as a URL.
It is possible to control what gets extracted exactly for matching purposes thanks to the "extract" argument expected with every selector. Possible values are:
When not specified, the default is "text".
The default selectors / extract strategies are:
For any extracted link values, this extractor will perform minimal
heuristics to clean extra content not part of a regular URL. For instance,
it will only keep what is after url=
when dealing with
<meta http-equiv
refresh URLs. It will also trim white
spaces.
By default, contextual information is kept about the HTML/XML mark-up
tag from which a link is extracted (e.g., tag name and attributes).
That information gets stored as metadata in the target document.
If you want to limit the quantity of information extracted/stored,
you can disable this feature by setting
ignoreLinkData
to true
.
Only valid
schemes are extracted for absolute URLs. By default, those are
http
, https
, and ftp
. You can
specify your own list of supported protocols with
setSchemes(String[])
.
By default, this extractor only will be applied on documents matching one of these content types:
By default, a regular HTML link having the "rel" attribute set to "nofollow"
won't be extracted (e.g.
<a href="x.html" rel="nofollow" ...>
).
To force its extraction (and ensure it is followed) you can set
setIgnoreNofollow(boolean)
to true
.
<extractor
class="com.norconex.collector.http.link.impl.DOMLinkExtractor"
ignoreNofollow="[false|true]"
ignoreLinkData="[false|true]"
parser="[html|xml]"
charset="(supported character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(optional expression for fields used for links extraction instead
of the document stream)
</fieldMatcher>
<schemes>
(CSV list of URI scheme for which to perform link extraction.
leave blank or remove tag to use defaults.)
</schemes>
<!-- Repeat as needed: -->
<linkSelector
extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]">
(selector syntax)
</linkSelector>
<!--
Optional. Only apply link selectors to portions of a document
matching these selectors. Repeat as needed.
-->
<extractSelector>(selector syntax)</extractSelector>
<!--
Optional. Do not apply link selectors to portions of a document
matching these selectors. Repeat as needed.
-->
<noExtractSelector>(selector syntax)</noExtractSelector>
</extractor>
<extractor
class="com.norconex.collector.http.link.impl.DOMLinkExtractor">
<linkSelector
extract="attr(href)">
a[href]
</linkSelector>
<linkSelector
extract="attr(src)">
[src]
</linkSelector>
<linkSelector
extract="attr(href)">
link[href]
</linkSelector>
<linkSelector
extract="attr(content)">
meta[http-equiv='refresh']
</linkSelector>
<linkSelector
extract="attr(data-myurl)">
[data-myurl]
</linkSelector>
</extractor>
The above example will extract URLs found in custom element attributes named
data-myurl
.
Constructor and Description |
---|
DOMLinkExtractor() |
Modifier and Type | Method and Description |
---|---|
void |
addExtractSelectors(List<String> selectors) |
void |
addExtractSelectors(String... selectors) |
void |
addLinkSelector(String selector)
Adds a new link selector extracting the "text" from matches.
|
void |
addLinkSelector(String selector,
String extract) |
void |
addNoExtractSelectors(List<String> selectors) |
void |
addNoExtractSelectors(String... selectors) |
void |
clearLinkSelectors() |
boolean |
equals(Object other) |
void |
extractTextLinks(Set<Link> links,
HandlerDoc doc,
Reader reader) |
String |
getCharset()
Gets the assumed source character encoding.
|
List<String> |
getExtractSelectors() |
List<String> |
getNoExtractSelectors() |
String |
getParser()
Gets the parser to use when creating the DOM-tree.
|
List<String> |
getSchemes()
Gets the schemes to be extracted.
|
int |
hashCode() |
boolean |
isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.
|
boolean |
isIgnoreNofollow() |
protected void |
loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
void |
removeLinkSelector(String selector) |
protected void |
saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setCharset(String charset)
Sets the assumed source character encoding.
|
void |
setExtractSelectors(List<String> selectors) |
void |
setExtractSelectors(String... selectors) |
void |
setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.
|
void |
setIgnoreNofollow(boolean ignoreNofollow) |
void |
setNoExtractSelectors(List<String> selectors) |
void |
setNoExtractSelectors(String... selectors) |
void |
setParser(String parser)
Sets the parser to use when creating the DOM-tree.
|
void |
setSchemes(List<String> schemes)
Sets the schemes to be extracted.
|
void |
setSchemes(String... schemes)
Sets the schemes to be extracted.
|
String |
toString() |
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
public String getCharset()
public void setCharset(String charset)
charset
- character encoding of the source to be transformedpublic String getParser()
html
(default) or xml
.public void setParser(String parser)
parser
- html
or xml
.public boolean isIgnoreNofollow()
public void setIgnoreNofollow(boolean ignoreNofollow)
public boolean isIgnoreLinkData()
true
to ignore.public void setIgnoreLinkData(boolean ignoreLinkData)
ignoreLinkData
- true
to ignore.public void addLinkSelector(String selector)
selector
- JSoup selectorpublic void removeLinkSelector(String selector)
public void clearLinkSelectors()
public void setExtractSelectors(String... selectors)
public void addExtractSelectors(String... selectors)
public void setNoExtractSelectors(String... selectors)
public void addNoExtractSelectors(String... selectors)
public List<String> getSchemes()
public void setSchemes(String... schemes)
schemes
- schemes to be extractedpublic void setSchemes(List<String> schemes)
schemes
- schemes to be extractedpublic void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
extractTextLinks
in class AbstractTextLinkExtractor
IOException
protected void loadTextLinkExtractorFromXML(XML xml)
AbstractTextLinkExtractor
loadTextLinkExtractorFromXML
in class AbstractTextLinkExtractor
xml
- XML configurationprotected void saveTextLinkExtractorToXML(XML xml)
AbstractTextLinkExtractor
saveTextLinkExtractorToXML
in class AbstractTextLinkExtractor
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractTextLinkExtractor
public int hashCode()
hashCode
in class AbstractTextLinkExtractor
public String toString()
toString
in class AbstractTextLinkExtractor
Copyright © 2009–2023 Norconex Inc.. All rights reserved.