Class DOMSplitter

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentSplitter

    public class DOMSplitter
    extends AbstractDocumentSplitter
    implements IXMLConfigurable

    Splits HTML, XHTML, or XML document on elements matching a given selector.

    This class constructs a DOM tree from the document content. That DOM tree is loaded entirely into memory. Use this splitter with caution if you know you'll need to parse huge files. It may be preferable to use a stream-based approach if this is a concern (e.g., XMLStreamSplitter).

    The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.

    Should be used as a pre-parse handler.

    Content-types

    By default, this filter is restricted to (applies only to) documents matching the restrictions returned by CommonRestrictions.domContentTypes(String). You can specify your own content types if you know they represent a file with HTML or XML-like markup tags.

    Since 2.5.0, when used as a pre-parse handler, this class attempts to detect the content character encoding unless the character encoding was specified using setSourceCharset(String). Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.

    Since 2.8.0, you can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
        selector="(selector syntax)"
        parser="[html|xml]"
        sourceCharset="(character encoding)">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
    </handler>

    XML usage example:

    
    <handler
        class="DOMSplitter"
        selector="div.contact"/>

    The above example splits contacts found in an HTML document, each one being stored within a div with a class named "contact".

    Since:
    2.4.0
    Author:
    Pascal Essiembre
    See Also:
    XMLStreamSplitter