DOMLinkExtractor (Norconex HTTP Collector 3.0.2 API)

java.lang.Object
- com.norconex.collector.http.link.AbstractLinkExtractor
- - com.norconex.collector.http.link.AbstractTextLinkExtractor
  - - com.norconex.collector.http.link.impl.DOMLinkExtractor

All Implemented Interfaces:

ILinkExtractor, IXMLConfigurable
```
public class DOMLinkExtractor
extends AbstractTextLinkExtractor
```
Extracts links from a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content based on values of matching elements and attributes.

In order to construct a DOM tree, text is loaded entirely into memory. It uses the document content by default, but it can also come from specified metadata fields. Use this filter with caution if you know you'll need to parse huge files. Use the HtmlLinkExtractor instead if this is a concern.

The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.

This link extractor is normally used before importing.

When used before importing this class attempts to detect the content character encoding unless the character encoding was specified using setCharset(String). Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.

You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.

Matching links

You can define as many JSoup "selectors" as desired. All values matched by a selector will be extracted as a URL.

It is possible to control what gets extracted exactly for matching purposes thanks to the "extract" argument expected with every selector. Possible values are:
- text: Default option when extract is blank. The text of the element, including combined children.
- html: Extracts an element inner HTML (including children).
- outerHtml: Extracts an element outer HTML (like "html", but includes the "current" tag).
- ownText: Extracts the text owned by this element only; does not get the combined text of all children.
- data: Extracts the combined data of a data-element (e.g. <script>).
- id: Extracts the ID attribute of the element (if any).
- tagName: Extract the name of the tag of the element.
- val: Extracts the value of a form element (input, textarea, etc).
- className: Extracts the literal value of the element's "class" attribute, which may include multiple class names, space separated.
- cssSelector: Extracts a CSS selector that will uniquely select (identify) this element.
- attr(attributeKey): Extracts the value of the element attribute matching your replacement for "attributeKey" (e.g. "attr(title)" will extract the "title" attribute).
When not specified, the default is "text".

The default selectors / extract strategies are:
- a[href] / attr(href)
- [src] / attr(src)
- link[href] / attr(href)
- meta[http-equiv='refresh'] / attr(content)
For any extracted link values, this extractor will perform minimal heuristics to clean extra content not part of a regular URL. For instance, it will only keep what is after url= when dealing with <meta http-equiv refresh URLs. It will also trim white spaces.

Ignoring link data

By default, contextual information is kept about the HTML/XML mark-up tag from which a link is extracted (e.g., tag name and attributes). That information gets stored as metadata in the target document. If you want to limit the quantity of information extracted/stored, you can disable this feature by setting ignoreLinkData to true.

URL Schemes

Only valid schemes are extracted for absolute URLs. By default, those are http, https, and ftp. You can specify your own list of supported protocols with setSchemes(String[]).

Applicable documents

By default, this extractor only will be applied on documents matching one of these content types:
- application/atom+xml
- application/mathml+xml
- application/rss+xml
- application/vnd.wap.xhtml+xml
- application/x-asp
- application/xhtml+xml
- application/xml
- application/xslt+xml
- image/svg+xml
- text/html
- text/xml
"nofollow"

By default, a regular HTML link having the "rel" attribute set to "nofollow" won't be extracted (e.g. <a href="x.html" rel="nofollow" ...>). To force its extraction (and ensure it is followed) you can set setIgnoreNofollow(boolean) to true.

XML configuration usage:
```
<extractor
    class="com.norconex.collector.http.link.impl.DOMLinkExtractor"
    ignoreNofollow="[false|true]"
    ignoreLinkData="[false|true]"
    parser="[html|xml]"
    charset="(supported character encoding)">
  
  <restrictTo>
    <fieldMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (field-matching expression)
    </fieldMatcher>
    <valueMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (value-matching expression)
    </valueMatcher>
  </restrictTo>
  <fieldMatcher
      method="[basic|csv|wildcard|regex]"
      ignoreCase="[false|true]"
      ignoreDiacritic="[false|true]"
      partial="[false|true]">
    (optional expression for fields used for links extraction instead
     of the document stream)
  </fieldMatcher>
  <schemes>
    (CSV list of URI scheme for which to perform link extraction.
     leave blank or remove tag to use defaults.)
  </schemes>
  
  <linkSelector
      extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]">
    (selector syntax)
  </linkSelector>
  
  <extractSelector>(selector syntax)</extractSelector>
  
  <noExtractSelector>(selector syntax)</noExtractSelector>
</extractor>
```
XML usage example:
```
<extractor
    class="com.norconex.collector.http.link.impl.DOMLinkExtractor">
  <linkSelector
      extract="attr(href)">
    a[href]
  </linkSelector>
  <linkSelector
      extract="attr(src)">
    [src]
  </linkSelector>
  <linkSelector
      extract="attr(href)">
    link[href]
  </linkSelector>
  <linkSelector
      extract="attr(content)">
    meta[http-equiv='refresh']
  </linkSelector>
  <linkSelector
      extract="attr(data-myurl)">
    [data-myurl]
  </linkSelector>
</extractor>
```
The above example will extract URLs found in custom element attributes named data-myurl.
Since:

3.0.0

Author:

Pascal Essiembre

Constructor Summary

Constructors
Constructor and Description

DOMLinkExtractor()

Constructors
Constructor and Description
`DOMLinkExtractor()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addExtractSelectors(List<String> selectors)`
`void`	`addExtractSelectors(String... selectors)`
`void`	`addLinkSelector(String selector)` Adds a new link selector extracting the "text" from matches.
`void`	`addLinkSelector(String selector, String extract)`
`void`	`addNoExtractSelectors(List<String> selectors)`
`void`	`addNoExtractSelectors(String... selectors)`
`void`	`clearLinkSelectors()`
`boolean`	`equals(Object other)`
`void`	`extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)`
`String`	`getCharset()` Gets the assumed source character encoding.
`List<String>`	`getExtractSelectors()`
`List<String>`	`getNoExtractSelectors()`
`String`	`getParser()` Gets the parser to use when creating the DOM-tree.
`List<String>`	`getSchemes()` Gets the schemes to be extracted.
`int`	`hashCode()`
`boolean`	`isIgnoreLinkData()` Gets whether to ignore extra data associated with a link.
`boolean`	`isIgnoreNofollow()`
`protected void`	`loadTextLinkExtractorFromXML(XML xml)` Loads configuration settings specific to the implementing class.
`void`	`removeLinkSelector(String selector)`
`protected void`	`saveTextLinkExtractorToXML(XML xml)` Saves configuration settings specific to the implementing class.
`void`	`setCharset(String charset)` Sets the assumed source character encoding.
`void`	`setExtractSelectors(List<String> selectors)`
`void`	`setExtractSelectors(String... selectors)`
`void`	`setIgnoreLinkData(boolean ignoreLinkData)` Sets whether to ignore extra data associated with a link.
`void`	`setIgnoreNofollow(boolean ignoreNofollow)`
`void`	`setNoExtractSelectors(List<String> selectors)`
`void`	`setNoExtractSelectors(String... selectors)`
`void`	`setParser(String parser)` Sets the parser to use when creating the DOM-tree.
`void`	`setSchemes(List<String> schemes)` Sets the schemes to be extracted.
`void`	`setSchemes(String... schemes)` Sets the schemes to be extracted.
`String`	`toString()`

Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher

Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - DOMLinkExtractor
```
public DOMLinkExtractor()
```
- Method Detail
  - getCharset
```
public String getCharset()
```
    Gets the assumed source character encoding.
    
    Returns:
    
    character encoding of the source to be transformed
  - setCharset
```
public void setCharset(String charset)
```
    Sets the assumed source character encoding.
    
    Parameters:
    
    charset - character encoding of the source to be transformed
  - getParser
```
public String getParser()
```
    Gets the parser to use when creating the DOM-tree.
    
    Returns:
    
    html (default) or xml.
  - setParser
```
public void setParser(String parser)
```
    Sets the parser to use when creating the DOM-tree.
    
    Parameters:
    
    parser - html or xml.
  - isIgnoreNofollow
```
public boolean isIgnoreNofollow()
```
  - setIgnoreNofollow
```
public void setIgnoreNofollow(boolean ignoreNofollow)
```
  - isIgnoreLinkData
```
public boolean isIgnoreLinkData()
```
    Gets whether to ignore extra data associated with a link.
    
    Returns:
    
    true to ignore.
  - setIgnoreLinkData
```
public void setIgnoreLinkData(boolean ignoreLinkData)
```
    Sets whether to ignore extra data associated with a link.
    
    Parameters:
    
    ignoreLinkData - true to ignore.
  - addLinkSelector
```
public void addLinkSelector(String selector)
```
    Adds a new link selector extracting the "text" from matches.
    
    Parameters:
    
    selector - JSoup selector
  - addLinkSelector
```
public void addLinkSelector(String selector,
                            String extract)
```
  - removeLinkSelector
```
public void removeLinkSelector(String selector)
```
  - clearLinkSelectors
```
public void clearLinkSelectors()
```
  - getExtractSelectors
```
public List<String> getExtractSelectors()
```
  - setExtractSelectors
```
public void setExtractSelectors(List<String> selectors)
```
  - setExtractSelectors
```
public void setExtractSelectors(String... selectors)
```
  - addExtractSelectors
```
public void addExtractSelectors(List<String> selectors)
```
  - addExtractSelectors
```
public void addExtractSelectors(String... selectors)
```
  - getNoExtractSelectors
```
public List<String> getNoExtractSelectors()
```
  - setNoExtractSelectors
```
public void setNoExtractSelectors(List<String> selectors)
```
  - setNoExtractSelectors
```
public void setNoExtractSelectors(String... selectors)
```
  - addNoExtractSelectors
```
public void addNoExtractSelectors(List<String> selectors)
```
  - addNoExtractSelectors
```
public void addNoExtractSelectors(String... selectors)
```
  - getSchemes
```
public List<String> getSchemes()
```
    Gets the schemes to be extracted.
    
    Returns:
    
    schemes to be extracted
  - setSchemes
```
public void setSchemes(String... schemes)
```
    Sets the schemes to be extracted.
    
    Parameters:
    
    schemes - schemes to be extracted
  - setSchemes
```
public void setSchemes(List<String> schemes)
```
    Sets the schemes to be extracted.
    
    Parameters:
    
    schemes - schemes to be extracted
  - extractTextLinks
```
public void extractTextLinks(Set<Link> links,
                             HandlerDoc doc,
                             Reader reader)
                      throws IOException
```
    Specified by:
    
    extractTextLinks in class AbstractTextLinkExtractor
    
    Throws:
    
    IOException
  - loadTextLinkExtractorFromXML
```
protected void loadTextLinkExtractorFromXML(XML xml)
```
    Description copied from class: AbstractTextLinkExtractor
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadTextLinkExtractorFromXML in class AbstractTextLinkExtractor
    
    Parameters:
    
    xml - XML configuration
  - saveTextLinkExtractorToXML
```
protected void saveTextLinkExtractorToXML(XML xml)
```
    Description copied from class: AbstractTextLinkExtractor
    
    Saves configuration settings specific to the implementing class.
    
    Specified by:
    
    saveTextLinkExtractorToXML in class AbstractTextLinkExtractor
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractTextLinkExtractor
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractTextLinkExtractor
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractTextLinkExtractor

Class DOMLinkExtractor

Matching links

Ignoring link data

URL Schemes

Applicable documents

"nofollow"

XML configuration usage:

XML usage example:

Constructor Summary

Method Summary

Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor

Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor

Methods inherited from class java.lang.Object

Constructor Detail

DOMLinkExtractor

Method Detail

getCharset

setCharset

getParser

setParser

isIgnoreNofollow

setIgnoreNofollow

isIgnoreLinkData

setIgnoreLinkData

addLinkSelector

addLinkSelector

removeLinkSelector

clearLinkSelectors

getExtractSelectors

setExtractSelectors

setExtractSelectors

addExtractSelectors

addExtractSelectors

getNoExtractSelectors

setNoExtractSelectors

setNoExtractSelectors

addNoExtractSelectors

addNoExtractSelectors

getSchemes

setSchemes

setSchemes

extractTextLinks

loadTextLinkExtractorFromXML

saveTextLinkExtractorToXML

equals

hashCode

toString