Class DOMLinkExtractor

All Implemented Interfaces:
ILinkExtractor, IXMLConfigurable

public class DOMLinkExtractor extends AbstractTextLinkExtractor

Extracts links from a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content based on values of matching elements and attributes.

In order to construct a DOM tree, text is loaded entirely into memory. It uses the document content by default, but it can also come from specified metadata fields. Use this filter with caution if you know you'll need to parse huge files. Use the HtmlLinkExtractor instead if this is a concern.

The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.

This link extractor is normally used before importing.

When used before importing this class attempts to detect the content character encoding unless the character encoding was specified using setCharset(String). Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.

You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.

Matching links

You can define as many JSoup "selectors" as desired. All values matched by a selector will be extracted as a URL.

It is possible to control what gets extracted exactly for matching purposes thanks to the "extract" argument expected with every selector. Possible values are:

When not specified, the default is "text".

The default selectors / extract strategies are:

  • a[href] / attr(href)
  • [src] / attr(src)
  • link[href] / attr(href)
  • meta[http-equiv='refresh'] / attr(content)

For any extracted link values, this extractor will perform minimal heuristics to clean extra content not part of a regular URL. For instance, it will only keep what is after url= when dealing with <meta http-equiv refresh URLs. It will also trim white spaces.

Ignoring link data

By default, contextual information is kept about the HTML/XML mark-up tag from which a link is extracted (e.g., tag name and attributes). That information gets stored as metadata in the target document. If you want to limit the quantity of information extracted/stored, you can disable this feature by setting ignoreLinkData to true.

URL Schemes

Only valid schemes are extracted for absolute URLs. By default, those are http, https, and ftp. You can specify your own list of supported protocols with setSchemes(String[]).

Applicable documents

By default, this extractor only will be applied on documents matching one of these content types:

"nofollow"

By default, a regular HTML link having the "rel" attribute set to "nofollow" won't be extracted (e.g. <a href="x.html" rel="nofollow" ...>). To force its extraction (and ensure it is followed) you can set setIgnoreNofollow(boolean) to true.

XML configuration usage:


<extractor
    class="com.norconex.collector.http.link.impl.DOMLinkExtractor"
    ignoreNofollow="[false|true]"
    ignoreLinkData="[false|true]"
    parser="[html|xml]"
    charset="(supported character encoding)">
  <fieldMatcher>
    (optional expression for fields used for links extraction instead
     of the document stream)
  </fieldMatcher>
  <schemes>
    (CSV list of URI scheme for which to perform link extraction.
     leave blank or remove tag to use defaults.)
  </schemes>
  <!-- Repeat as needed: -->
  <linkSelector>(selector syntax)</linkSelector>
  <!--
    Optional. Only apply link selectors to portions of a document
            matching these selectors. Repeat as needed.
    -->
  <extractSelector>(selector syntax)</extractSelector>
  <!--
    Optional. Do not apply link selectors to portions of a document
            matching these selectors. Repeat as needed.
    -->
  <noExtractSelector>(selector syntax)</noExtractSelector>
</extractor>

XML usage example:


<extractor
    class="com.norconex.collector.http.link.impl.DOMLinkExtractor">
  <linkSelector
      extract="attr(href)">
    a[href]
  </linkSelector>
  <linkSelector
      extract="attr(src)">
    [src]
  </linkSelector>
  <linkSelector
      extract="attr(href)">
    link[href]
  </linkSelector>
  <linkSelector
      extract="attr(content)">
    meta[http-equiv='refresh']
  </linkSelector>
  <linkSelector
      extract="attr(data-myurl)">
    [data-myurl]
  </linkSelector>
</extractor>

The above example will extract URLs found in custom element attributes named data-myurl.

Since:
3.0.0
Author:
Pascal Essiembre
  • Constructor Details

    • DOMLinkExtractor

      public DOMLinkExtractor()
  • Method Details

    • getCharset

      public String getCharset()
      Gets the assumed source character encoding.
      Returns:
      character encoding of the source to be transformed
    • setCharset

      public void setCharset(String charset)
      Sets the assumed source character encoding.
      Parameters:
      charset - character encoding of the source to be transformed
    • getParser

      public String getParser()
      Gets the parser to use when creating the DOM-tree.
      Returns:
      html (default) or xml.
    • setParser

      public void setParser(String parser)
      Sets the parser to use when creating the DOM-tree.
      Parameters:
      parser - html or xml.
    • isIgnoreNofollow

      public boolean isIgnoreNofollow()
    • setIgnoreNofollow

      public void setIgnoreNofollow(boolean ignoreNofollow)
    • isIgnoreLinkData

      public boolean isIgnoreLinkData()
      Gets whether to ignore extra data associated with a link.
      Returns:
      true to ignore.
    • setIgnoreLinkData

      public void setIgnoreLinkData(boolean ignoreLinkData)
      Sets whether to ignore extra data associated with a link.
      Parameters:
      ignoreLinkData - true to ignore.
    • addLinkSelector

      public void addLinkSelector(String selector)
      Adds a new link selector extracting the "text" from matches.
      Parameters:
      selector - JSoup selector
    • addLinkSelector

      public void addLinkSelector(String selector, String extract)
    • removeLinkSelector

      public void removeLinkSelector(String selector)
    • clearLinkSelectors

      public void clearLinkSelectors()
    • getExtractSelectors

      public List<String> getExtractSelectors()
    • setExtractSelectors

      public void setExtractSelectors(List<String> selectors)
    • setExtractSelectors

      public void setExtractSelectors(String... selectors)
    • addExtractSelectors

      public void addExtractSelectors(List<String> selectors)
    • addExtractSelectors

      public void addExtractSelectors(String... selectors)
    • getNoExtractSelectors

      public List<String> getNoExtractSelectors()
    • setNoExtractSelectors

      public void setNoExtractSelectors(List<String> selectors)
    • setNoExtractSelectors

      public void setNoExtractSelectors(String... selectors)
    • addNoExtractSelectors

      public void addNoExtractSelectors(List<String> selectors)
    • addNoExtractSelectors

      public void addNoExtractSelectors(String... selectors)
    • getSchemes

      public List<String> getSchemes()
      Gets the schemes to be extracted.
      Returns:
      schemes to be extracted
    • setSchemes

      public void setSchemes(String... schemes)
      Sets the schemes to be extracted.
      Parameters:
      schemes - schemes to be extracted
    • setSchemes

      public void setSchemes(List<String> schemes)
      Sets the schemes to be extracted.
      Parameters:
      schemes - schemes to be extracted
    • extractTextLinks

      public void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
      Specified by:
      extractTextLinks in class AbstractTextLinkExtractor
      Throws:
      IOException
    • loadTextLinkExtractorFromXML

      protected void loadTextLinkExtractorFromXML(XML xml)
      Description copied from class: AbstractTextLinkExtractor
      Loads configuration settings specific to the implementing class.
      Specified by:
      loadTextLinkExtractorFromXML in class AbstractTextLinkExtractor
      Parameters:
      xml - XML configuration
    • saveTextLinkExtractorToXML

      protected void saveTextLinkExtractorToXML(XML xml)
      Description copied from class: AbstractTextLinkExtractor
      Saves configuration settings specific to the implementing class.
      Specified by:
      saveTextLinkExtractorToXML in class AbstractTextLinkExtractor
      Parameters:
      xml - the XML
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class AbstractTextLinkExtractor
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class AbstractTextLinkExtractor
    • toString

      public String toString()
      Overrides:
      toString in class AbstractTextLinkExtractor