public class DOMTagger extends AbstractDocumentTagger
Extract the value of one or more elements or attributes into
a target field, from and HTML, XHTML, or XML document. If a target field
already exists, extracted values will be added to existing values,
unless "overwrite" is set to true
.
This class constructs a DOM tree from the document content. That DOM tree
is loaded entirely into memory. Use this tagger with caution if you know
you'll need to parse huge files. It may be preferable to use
TextPatternTagger
if this is a concern. Also, to help performance
and avoid re-creating DOM tree before every DOM extraction you want to
perform, try to combine multiple extractions in a single instance
of this Tagger.
The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
Should be used as a pre-parse handler.
By default, this filter is restricted to (applies only to) documents matching
the restrictions returned by
CommonRestrictions.domContentTypes()
.
You can specify your own content types if you know they represent a file
with HTML or XML-like markup tags.
Since 2.5.0, when used as a pre-parse handler,
this class attempts to detect the content character
encoding unless the character encoding
was specified using setSourceCharset(String)
. Since document
parsing converts content to UTF-8, UTF-8 is always assumed when
used as a post-parse handler.
Since 2.5.0, it is possible to control what gets extracted
exactly thanks to the "extract" argument of the new method
DOMTagger.DOMExtractDetails.setExtract(String)
. Version 2.6.0
introduced several more extract options. Possible values are:
Since 2.6.0, it is possible to specify a fromField
as the source of the HTML to parse instead of using the document content.
If multiple values are present for that source field, DOM extraction will be
applied to each value.
Since 2.6.0, it is possible to specify a defaultValue
on each DOM extraction details. When no match occurred for a given selector,
the default value will be stored in the toField
(as opposed
to not storing anything). When matching blanks (see below) you will get
an empty string as opposed to the default value.
As of 2.6.1, empty strings and spaces are supported as default values
(the default value is now taken litterally).
Since 2.6.1, you can set matchBlanks
to
true
to match elements that are present
but have blank values. Blank values are empty values or values containing
white spaces only. Because white spaces are normalized by the DOM parser,
such matches will always return an empty string (spaces will be trimmed).
By default elements with blank values are not matched and are ignored.
Since 2.8.0, you can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" fromField="(optional source field)" parser="[html|xml]" sourceCharset="(character encoding)"> <restrictTo caseSensitive="[false|true]" field="(name of metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <dom selector="(selector syntax)" toField="(target field)" overwrite="[false|true]" extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]" matchBlanks="[false|true]" defaultValue="(optional value to use when no match)" /> <!-- multiple "dom" tags allowed --> </tagger>
Given this HTML snippet...
<div class="firstName">Joe</div> <div class="lastName">Dalton</div>
... the following will store "Joe" in a "firstName" field and "Dalton" in a "lastName" field.
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"> <dom selector="div.firstName" toField="firstName" /> <dom selector="div.lastName" toField="lastName" /> </tagger>
Modifier and Type | Class and Description |
---|---|
static class |
DOMTagger.DOMExtractDetails
DOM Extraction Details
|
Constructor and Description |
---|
DOMTagger()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
void |
addDOMExtractDetails(DOMTagger.DOMExtractDetails extractDetails)
Adds DOM extraction details.
|
void |
addDOMExtractDetails(String selector,
String toField,
boolean overwrite)
Deprecated.
Since 2.6.0, use
addDOMExtractDetails(DOMExtractDetails) instead. |
void |
addDOMExtractDetails(String selector,
String toField,
boolean overwrite,
String extract)
Deprecated.
Since 2.6.0, use
addDOMExtractDetails(DOMExtractDetails) instead. |
boolean |
equals(Object other) |
List<DOMTagger.DOMExtractDetails> |
getDOMExtractDetailsList()
Gets a list of DOM extraction details.
|
String |
getFromField()
Gets optional source field holding the HTML content to apply DOM
extraction to.
|
String |
getParser()
Gets the parser to use when creating the DOM-tree.
|
String |
getSourceCharset()
Gets the assumed source character encoding.
|
int |
hashCode() |
protected void |
loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
void |
removeDOMExtractDetails(String selector)
Removes the DOM extraction details matching the given selector
|
protected void |
saveHandlerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setFromField(String fromField)
Sets optional source field holding the HTML content to apply DOM
extraction to.
|
void |
setParser(String parser)
Sets the parser to use when creating the DOM-tree.
|
void |
setSourceCharset(String sourceCharset)
Sets the assumed source character encoding.
|
protected void |
tagApplicableDocument(String reference,
InputStream document,
ImporterMetadata metadata,
boolean parsed) |
String |
toString() |
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
public String getSourceCharset()
public void setSourceCharset(String sourceCharset)
sourceCharset
- character encoding of the source to be transformedpublic String getFromField()
public void setFromField(String fromField)
fromField
- from fieldpublic String getParser()
html
(default) or xml
.public void setParser(String parser)
parser
- html
or xml
.protected void tagApplicableDocument(String reference, InputStream document, ImporterMetadata metadata, boolean parsed) throws ImporterHandlerException
tagApplicableDocument
in class AbstractDocumentTagger
ImporterHandlerException
public void addDOMExtractDetails(DOMTagger.DOMExtractDetails extractDetails)
extractDetails
- DOM extraction detailspublic List<DOMTagger.DOMExtractDetails> getDOMExtractDetailsList()
public void removeDOMExtractDetails(String selector)
selector
- DOM selector@Deprecated public void addDOMExtractDetails(String selector, String toField, boolean overwrite)
addDOMExtractDetails(DOMExtractDetails)
instead.selector
- selectortoField
- target field nameoverwrite
- whether toField overwrite target field if it exists@Deprecated public void addDOMExtractDetails(String selector, String toField, boolean overwrite, String extract)
addDOMExtractDetails(DOMExtractDetails)
instead.selector
- selectortoField
- target field nameoverwrite
- whether toField overwrite target field if it existsextract
- one of: html, outerHtml, text, ownText, data, tagName,
val, className, cssSelector, or attr(attributeKey)protected void loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- xml configurationIOException
- could not load from XMLprotected void saveHandlerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
writer
- the xml writerXMLStreamException
- could not save to XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2021 Norconex Inc.. All rights reserved.