Class DOMSplitter
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.splitter.AbstractDocumentSplitter
-
- com.norconex.importer.handler.splitter.impl.DOMSplitter
-
- All Implemented Interfaces:
IXMLConfigurable
,IImporterHandler
,IDocumentSplitter
public class DOMSplitter extends AbstractDocumentSplitter implements IXMLConfigurable
Splits HTML, XHTML, or XML document on elements matching a given selector.
This class constructs a DOM tree from the document content. That DOM tree is loaded entirely into memory. Use this splitter with caution if you know you'll need to parse huge files. It may be preferable to use a stream-based approach if this is a concern (e.g.,
XMLStreamSplitter
).The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
Should be used as a pre-parse handler.
Content-types
By default, this filter is restricted to (applies only to) documents matching the restrictions returned by
CommonRestrictions.domContentTypes(String)
. You can specify your own content types if you know they represent a file with HTML or XML-like markup tags.Since 2.5.0, when used as a pre-parse handler, this class attempts to detect the content character encoding unless the character encoding was specified using
setSourceCharset(String)
. Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.Since 2.8.0, you can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.
XML configuration usage:
<handler class="com.norconex.importer.handler.splitter.impl.DOMSplitter" selector="(selector syntax)" parser="[html|xml]" sourceCharset="(character encoding)"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> </handler>
XML usage example:
<handler class="DOMSplitter" selector="div.contact"/>
The above example splits contacts found in an HTML document, each one being stored within a div with a class named "contact".
- Since:
- 2.4.0
- Author:
- Pascal Essiembre
- See Also:
XMLStreamSplitter
-
-
Constructor Summary
Constructors Constructor Description DOMSplitter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
String
getParser()
Gets the parser to use when creating the DOM-tree.String
getSelector()
String
getSourceCharset()
Gets the assumed source character encoding.int
hashCode()
protected void
loadHandlerFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveHandlerToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setParser(String parser)
Sets the parser to use when creating the DOM-tree.void
setSelector(String selector)
void
setSourceCharset(String sourceCharset)
Sets the assumed source character encoding.protected List<Doc>
splitApplicableDocument(HandlerDoc doc, InputStream input, OutputStream output, ParseState parseState)
String
toString()
-
Methods inherited from class com.norconex.importer.handler.splitter.AbstractDocumentSplitter
splitDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML
-
-
-
-
Method Detail
-
getSelector
public String getSelector()
-
setSelector
public void setSelector(String selector)
-
getSourceCharset
public String getSourceCharset()
Gets the assumed source character encoding.- Returns:
- character encoding of the source to be transformed
- Since:
- 2.5.0
-
setSourceCharset
public void setSourceCharset(String sourceCharset)
Sets the assumed source character encoding.- Parameters:
sourceCharset
- character encoding of the source to be transformed- Since:
- 2.5.0
-
getParser
public String getParser()
Gets the parser to use when creating the DOM-tree.- Returns:
html
(default) orxml
.- Since:
- 2.8.0
-
setParser
public void setParser(String parser)
Sets the parser to use when creating the DOM-tree.- Parameters:
parser
-html
orxml
.- Since:
- 2.8.0
-
splitApplicableDocument
protected List<Doc> splitApplicableDocument(HandlerDoc doc, InputStream input, OutputStream output, ParseState parseState) throws ImporterHandlerException
- Specified by:
splitApplicableDocument
in classAbstractDocumentSplitter
- Throws:
ImporterHandlerException
-
loadHandlerFromXML
protected void loadHandlerFromXML(XML xml)
Description copied from class:AbstractImporterHandler
Loads configuration settings specific to the implementing class.- Specified by:
loadHandlerFromXML
in classAbstractImporterHandler
- Parameters:
xml
- XML configuration
-
saveHandlerToXML
protected void saveHandlerToXML(XML xml)
Description copied from class:AbstractImporterHandler
Saves configuration settings specific to the implementing class.- Specified by:
saveHandlerToXML
in classAbstractImporterHandler
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractImporterHandler
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractImporterHandler
-
toString
public String toString()
- Overrides:
toString
in classAbstractImporterHandler
-
-