Class DOMFilter

  • All Implemented Interfaces:
    IXMLConfigurable, IDocumentFilter, IOnMatchFilter, IImporterHandler

    public class DOMFilter
    extends AbstractDocumentFilter

    Uses a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content to perform filtering based on matching an element/attribute or element/attribute value.

    In order to construct a DOM tree, text is loaded entirely into memory. It uses the document content by default, but it can also come from specified metadata fields. If multiple fields values are identified/matched as DOM sources, only one needs to match for the filter to be applied. Use this filter with caution if you know you'll need to parse huge files. You can use TextFilter instead if this is a concern.

    The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.

    If an element is referenced without a value to match, its mere presence constitutes a match. If both an element and a regular expression is provided the element value will be retrieved and the regular expression will be applied against it for a match.

    Refer to AbstractDocumentFilter for the inclusion/exclusion logic.

    Should be used as a pre-parse handler.

    Content-types

    By default, this filter is restricted to (applies only to) documents matching the restrictions returned by CommonRestrictions.domContentTypes(String). You can specify your own content types if you know they represent a file with HTML or XML-like markup tags. For documents that are incompatible, consider using RegexContentFilter instead.

    When used as a pre-parse handler, this class attempts to detect the content character encoding unless the character encoding was specified using setSourceCharset(String). Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.

    It is possible to control what gets extracted exactly for matching purposes thanks to the "extract" argument of the new method setExtract(String). Possible values are:

    You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.filter.impl.DOMContentFilter"
        onMatch="[include|exclude]"
        sourceCharset="(character encoding)"
        selector="(selector syntax)"
        parser="[html|xml]"
        extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
      <fieldMatcher>
        (optional expression matching fields where the DOM text is located)
      </fieldMatcher>
      <valueMatcher>
        (optional expression matching selector extracted value)
      </valueMatcher>
    </handler>

    XML usage example:

    
    <!-- Exclude an HTML page that has one or more GIF images in it: -->
    <handler
        class="DOMContentFilter"
        selector="img[src$=.gif]"
        onMatch="exclude"/>
    <!--
      Exclude an HTML page that has a paragraph tag with a class called
            "disclaimer" and a value containing "skip me":
      -->
    <handler
        class="DOMContentFilter"
        selector="p.disclaimer"
        onMatch="exclude">
      <valueMatcher
          method="regex">
        \bskip me\b
      </valueMatcher>
    </handler>
    Since:
    3.0.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • DOMFilter

        public DOMFilter()
    • Method Detail

      • getSelector

        public String getSelector()
      • setSelector

        public void setSelector​(String selector)
      • getFieldMatcher

        public TextMatcher getFieldMatcher()
        Gets this filter field matcher (copy).
        Returns:
        field matcher
      • setFieldMatcher

        public void setFieldMatcher​(TextMatcher fieldMatcher)
        Sets this filter field matcher (copy).
        Parameters:
        fieldMatcher - field matcher
      • getValueMatcher

        public TextMatcher getValueMatcher()
        Gets this filter value matcher (copy).
        Returns:
        value matcher
      • setValueMatcher

        public void setValueMatcher​(TextMatcher valueMatcher)
        Sets this filter value matcher (copy).
        Parameters:
        valueMatcher - value matcher
      • getExtract

        public String getExtract()
        Gets what should be extracted for the value. One of "text" (default), "html", or "outerHtml". null means this class will use the default ("text").
        Returns:
        what should be extracted for the value
      • setExtract

        public void setExtract​(String extract)
        Sets what should be extracted for the value. One of "text" (default), "html", or "outerHtml". null means this class will use the default ("text").
        Parameters:
        extract - what should be extracted for the value
      • getSourceCharset

        public String getSourceCharset()
        Gets the assumed source character encoding.
        Returns:
        character encoding of the source to be transformed
      • setSourceCharset

        public void setSourceCharset​(String sourceCharset)
        Sets the assumed source character encoding.
        Parameters:
        sourceCharset - character encoding of the source to be transformed
      • getParser

        public String getParser()
        Gets the parser to use when creating the DOM-tree.
        Returns:
        html (default) or xml.
      • setParser

        public void setParser​(String parser)
        Sets the parser to use when creating the DOM-tree.
        Parameters:
        parser - html or xml.