public class DOMSplitter extends AbstractDocumentSplitter implements IXMLConfigurable
Splits HTML, XHTML, or XML document on a specific element.
This class constructs a DOM tree from the document content. That DOM tree is loaded entirely into memory. Use this splitter with caution if you know you'll need to parse huge files. It may be preferable to use a stream-based approach if this is a concern.
The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
Should be used as a pre-parse handler.
By default, this filter is restricted to (applies only to) documents matching
the restrictions returned by
CommonRestrictions.domContentTypes()
.
You can specify your own content types if you know they represent a file
with HTML or XML-like markup tags.
Since 2.5.0, when used as a pre-parse handler,
this class attempts to detect the content character
encoding unless the character encoding
was specified using setSourceCharset(String)
. Since document
parsing converts content to UTF-8, UTF-8 is always assumed when
used as a post-parse handler.
Since 2.8.0, you can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.
<splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter" selector="(selector syntax)" parser="[html|xml]" sourceCharset="(character encoding)" > <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> </splitter>
The following split contacts found in an HTML document, each one being stored within a div with a class named "contact".
<splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter" selector="div.contact" />
Constructor and Description |
---|
DOMSplitter() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getParser()
Gets the parser to use when creating the DOM-tree.
|
String |
getSelector() |
String |
getSourceCharset()
Gets the assumed source character encoding.
|
int |
hashCode() |
protected void |
loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setParser(String parser)
Sets the parser to use when creating the DOM-tree.
|
void |
setSelector(String selector) |
void |
setSourceCharset(String sourceCharset)
Sets the assumed source character encoding.
|
protected List<ImporterDocument> |
splitApplicableDocument(SplittableDocument doc,
OutputStream output,
CachedStreamFactory streamFactory,
boolean parsed) |
String |
toString() |
splitDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public String getSelector()
public void setSelector(String selector)
public String getSourceCharset()
public void setSourceCharset(String sourceCharset)
sourceCharset
- character encoding of the source to be transformedpublic String getParser()
html
(default) or xml
.public void setParser(String parser)
parser
- html
or xml
.protected List<ImporterDocument> splitApplicableDocument(SplittableDocument doc, OutputStream output, CachedStreamFactory streamFactory, boolean parsed) throws ImporterHandlerException
splitApplicableDocument
in class AbstractDocumentSplitter
ImporterHandlerException
protected void loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- xml configurationprotected void saveHandlerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
writer
- the xml writerXMLStreamException
- could not save to XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2021 Norconex Inc.. All rights reserved.