Class DOMLinkExtractor

  • All Implemented Interfaces:
    ILinkExtractor, IXMLConfigurable

    public class DOMLinkExtractor
    extends AbstractTextLinkExtractor

    Extracts links from a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content based on values of matching elements and attributes.

    In order to construct a DOM tree, text is loaded entirely into memory. It uses the document content by default, but it can also come from specified metadata fields. Use this filter with caution if you know you'll need to parse huge files. Use the HtmlLinkExtractor instead if this is a concern.

    The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.

    This link extractor is normally used before importing.

    When used before importing this class attempts to detect the content character encoding unless the character encoding was specified using setCharset(String). Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.

    You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.

    Matching links

    You can define as many JSoup "selectors" as desired. All values matched by a selector will be extracted as a URL.

    It is possible to control what gets extracted exactly for matching purposes thanks to the "extract" argument expected with every selector. Possible values are:

    When not specified, the default is "text".

    The default selectors / extract strategies are:

    • a[href] / attr(href)
    • [src] / attr(src)
    • link[href] / attr(href)
    • meta[http-equiv='refresh'] / attr(content)

    For any extracted link values, this extractor will perform minimal heuristics to clean extra content not part of a regular URL. For instance, it will only keep what is after url= when dealing with <meta http-equiv refresh URLs. It will also trim white spaces.

    Ignoring link data

    By default, contextual information is kept about the HTML/XML mark-up tag from which a link is extracted (e.g., tag name and attributes). That information gets stored as metadata in the target document. If you want to limit the quantity of information extracted/stored, you can disable this feature by setting ignoreLinkData to true.

    URL Schemes

    Only valid schemes are extracted for absolute URLs. By default, those are http, https, and ftp. You can specify your own list of supported protocols with setSchemes(String[]).

    Applicable documents

    By default, this extractor only will be applied on documents matching one of these content types:

    "nofollow"

    By default, a regular HTML link having the "rel" attribute set to "nofollow" won't be extracted (e.g. <a href="x.html" rel="nofollow" ...>). To force its extraction (and ensure it is followed) you can set setIgnoreNofollow(boolean) to true.

    XML configuration usage:

    
    <extractor
        class="com.norconex.collector.http.link.impl.DOMLinkExtractor"
        ignoreNofollow="[false|true]"
        ignoreLinkData="[false|true]"
        parser="[html|xml]"
        charset="(supported character encoding)">
      <fieldMatcher>
        (optional expression for fields used for links extraction instead
         of the document stream)
      </fieldMatcher>
      <schemes>
        (CSV list of URI scheme for which to perform link extraction.
         leave blank or remove tag to use defaults.)
      </schemes>
      <!-- Repeat as needed: -->
      <linkSelector>(selector syntax)</linkSelector>
      <!--
        Optional. Only apply link selectors to portions of a document
                matching these selectors. Repeat as needed.
        -->
      <extractSelector>(selector syntax)</extractSelector>
      <!--
        Optional. Do not apply link selectors to portions of a document
                matching these selectors. Repeat as needed.
        -->
      <noExtractSelector>(selector syntax)</noExtractSelector>
    </extractor>

    XML usage example:

    
    <extractor
        class="com.norconex.collector.http.link.impl.DOMLinkExtractor">
      <linkSelector
          extract="attr(href)">
        a[href]
      </linkSelector>
      <linkSelector
          extract="attr(src)">
        [src]
      </linkSelector>
      <linkSelector
          extract="attr(href)">
        link[href]
      </linkSelector>
      <linkSelector
          extract="attr(content)">
        meta[http-equiv='refresh']
      </linkSelector>
      <linkSelector
          extract="attr(data-myurl)">
        [data-myurl]
      </linkSelector>
    </extractor>

    The above example will extract URLs found in custom element attributes named data-myurl.

    Since:
    3.0.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • DOMLinkExtractor

        public DOMLinkExtractor()
    • Method Detail

      • getCharset

        public String getCharset()
        Gets the assumed source character encoding.
        Returns:
        character encoding of the source to be transformed
      • setCharset

        public void setCharset​(String charset)
        Sets the assumed source character encoding.
        Parameters:
        charset - character encoding of the source to be transformed
      • getParser

        public String getParser()
        Gets the parser to use when creating the DOM-tree.
        Returns:
        html (default) or xml.
      • setParser

        public void setParser​(String parser)
        Sets the parser to use when creating the DOM-tree.
        Parameters:
        parser - html or xml.
      • isIgnoreNofollow

        public boolean isIgnoreNofollow()
      • setIgnoreNofollow

        public void setIgnoreNofollow​(boolean ignoreNofollow)
      • isIgnoreLinkData

        public boolean isIgnoreLinkData()
        Gets whether to ignore extra data associated with a link.
        Returns:
        true to ignore.
      • setIgnoreLinkData

        public void setIgnoreLinkData​(boolean ignoreLinkData)
        Sets whether to ignore extra data associated with a link.
        Parameters:
        ignoreLinkData - true to ignore.
      • addLinkSelector

        public void addLinkSelector​(String selector)
        Adds a new link selector extracting the "text" from matches.
        Parameters:
        selector - JSoup selector
      • addLinkSelector

        public void addLinkSelector​(String selector,
                                    String extract)
      • removeLinkSelector

        public void removeLinkSelector​(String selector)
      • clearLinkSelectors

        public void clearLinkSelectors()
      • getExtractSelectors

        public List<String> getExtractSelectors()
      • setExtractSelectors

        public void setExtractSelectors​(List<String> selectors)
      • setExtractSelectors

        public void setExtractSelectors​(String... selectors)
      • addExtractSelectors

        public void addExtractSelectors​(List<String> selectors)
      • addExtractSelectors

        public void addExtractSelectors​(String... selectors)
      • getNoExtractSelectors

        public List<String> getNoExtractSelectors()
      • setNoExtractSelectors

        public void setNoExtractSelectors​(List<String> selectors)
      • setNoExtractSelectors

        public void setNoExtractSelectors​(String... selectors)
      • addNoExtractSelectors

        public void addNoExtractSelectors​(List<String> selectors)
      • addNoExtractSelectors

        public void addNoExtractSelectors​(String... selectors)
      • getSchemes

        public List<String> getSchemes()
        Gets the schemes to be extracted.
        Returns:
        schemes to be extracted
      • setSchemes

        public void setSchemes​(String... schemes)
        Sets the schemes to be extracted.
        Parameters:
        schemes - schemes to be extracted
      • setSchemes

        public void setSchemes​(List<String> schemes)
        Sets the schemes to be extracted.
        Parameters:
        schemes - schemes to be extracted