Class DOMTagger
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.tagger.AbstractDocumentTagger
-
- com.norconex.importer.handler.tagger.impl.DOMTagger
-
- All Implemented Interfaces:
IXMLConfigurable
,IImporterHandler
,IDocumentTagger
public class DOMTagger extends AbstractDocumentTagger
Extract the value of one or more elements or attributes into a target field, or delete matching elements. Applies to HTML, XHTML, or XML document.
This class constructs a DOM tree from a document or field content. That DOM tree is loaded entirely into memory. Use this tagger with caution if you know you'll need to parse huge files. It may be preferable to use
RegexTagger
if this is a concern. Also, to help performance and avoid re-creating DOM tree before every DOM extraction you want to perform, try to combine multiple extractions in a single instance of this Tagger.The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
Should be used as a pre-parse handler.
Storing values in an existing field
If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a
PropertySetter
.Content-types
By default, this filter is restricted to (applies only to) documents matching the restrictions returned by
CommonRestrictions.domContentTypes(String)
. You can specify your own content types if you know they represent a file with HTML or XML-like markup tags.When used as a pre-parse handler, this class attempts to detect the content character encoding unless the character encoding was specified using
setSourceCharset(String)
. Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.You can control what gets extracted exactly thanks to the "extract" argument of the new method
DOMTagger.DOMExtractDetails.setExtract(String)
. Possible values are:- text: Default option when extract is blank. The text of the element, including combined children.
- html: Extracts an element inner HTML (including children).
- outerHtml: Extracts an element outer HTML (like "html", but includes the "current" tag).
- ownText: Extracts the text owned by this element only; does not get the combined text of all children.
- data: Extracts the combined data of a data-element (e.g. <script>).
- id: Extracts the ID attribute of the element (if any).
- tagName: Extract the name of the tag of the element.
- val: Extracts the value of a form element (input, textarea, etc).
- className: Extracts the literal value of the element's "class" attribute, which may include multiple class names, space separated.
- cssSelector: Extracts a CSS selector that will uniquely select (identify) this element.
- attr(attributeKey): Extracts the value of the element attribute matching your replacement for "attributeKey" (e.g. "attr(title)" will extract the "title" attribute).
You can specify a
fromField
as the source of the HTML to parse instead of using the document content. If multiple values are present for that source field, DOM extraction will be applied to each value.You can specify a
defaultValue
on each DOM extraction details. When no match occurred for a given selector, the default value will be stored in thetoField
(as opposed to not storing anything). When matching blanks (see below) you will get an empty string as opposed to the default value. Empty strings and spaces are supported as default values (the default value is now taken literally).You can set
matchBlanks
totrue
to match elements that are present but have blank values. Blank values are empty values or values containing white spaces only. Because white spaces are normalized by the DOM parser, such matches will always return an empty string (spaces will be trimmed). By default elements with blank values are not matched and are ignored.You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.
Content deletion from fields
As of 3.0.0, you can specify whether to delete any elements matched by the selector. You can use with a "toField" or on its own. Some options are ignored by deletions, such as "extract" or "defaultValue". Because taggers cannot modify the document content, deletion only applies to metadata fields. Use
DOMDeleteTransformer
to modify the document content.XML configuration usage:
<handler class="com.norconex.importer.handler.tagger.impl.DOMTagger" fromField="(optional source field)" parser="[html|xml]" sourceCharset="(character encoding)"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> <!-- multiple "dom" tags allowed --> <dom selector="(selector syntax)" toField="(target field)" extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]" matchBlanks="[false|true]" defaultValue="(optional value to use when no match)" delete="[false|true]"/> </handler>
XML usage example:
<handler class="DOMTagger"> <dom selector="div.firstName" toField="firstName"/> <dom selector="div.lastName" toField="lastName"/> </handler>
Given this HTML snippet...
<div class="firstName">Joe</div> <div class="lastName">Dalton</div>
... the above example will store "Joe" in a "firstName" field and "Dalton" in a "lastName" field.
- Since:
- 2.4.0
- Author:
- Pascal Essiembre
- See Also:
DOMDeleteTransformer
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
DOMTagger.DOMExtractDetails
DOM Extraction Details
-
Constructor Summary
Constructors Constructor Description DOMTagger()
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addDOMExtractDetails(DOMTagger.DOMExtractDetails extractDetails)
Adds DOM extraction details.boolean
equals(Object other)
List<DOMTagger.DOMExtractDetails>
getDOMExtractDetailsList()
Gets a list of DOM extraction details.String
getFromField()
Gets optional source field holding the HTML content to apply DOM extraction to.String
getParser()
Gets the parser to use when creating the DOM-tree.String
getSourceCharset()
Gets the assumed source character encoding.int
hashCode()
protected void
loadHandlerFromXML(XML xml)
Loads configuration settings specific to the implementing class.void
removeDOMExtractDetails(String selector)
Removes the DOM extraction details matching the given selectorprotected void
saveHandlerToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setFromField(String fromField)
Sets optional source field holding the HTML content to apply DOM extraction to.void
setParser(String parser)
Sets the parser to use when creating the DOM-tree.void
setSourceCharset(String sourceCharset)
Sets the assumed source character encoding.void
tagApplicableDocument(HandlerDoc doc, InputStream document, ParseState parseState)
String
toString()
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
-
-
-
Method Detail
-
getSourceCharset
public String getSourceCharset()
Gets the assumed source character encoding.- Returns:
- character encoding of the source to be transformed
- Since:
- 2.5.0
-
setSourceCharset
public void setSourceCharset(String sourceCharset)
Sets the assumed source character encoding.- Parameters:
sourceCharset
- character encoding of the source to be transformed- Since:
- 2.5.0
-
getFromField
public String getFromField()
Gets optional source field holding the HTML content to apply DOM extraction to.- Returns:
- from field
- Since:
- 2.6.0
-
setFromField
public void setFromField(String fromField)
Sets optional source field holding the HTML content to apply DOM extraction to.- Parameters:
fromField
- from field- Since:
- 2.6.0
-
getParser
public String getParser()
Gets the parser to use when creating the DOM-tree.- Returns:
html
(default) orxml
.- Since:
- 2.8.0
-
setParser
public void setParser(String parser)
Sets the parser to use when creating the DOM-tree.- Parameters:
parser
-html
orxml
.- Since:
- 2.8.0
-
tagApplicableDocument
public void tagApplicableDocument(HandlerDoc doc, InputStream document, ParseState parseState) throws ImporterHandlerException
- Specified by:
tagApplicableDocument
in classAbstractDocumentTagger
- Throws:
ImporterHandlerException
-
addDOMExtractDetails
public void addDOMExtractDetails(DOMTagger.DOMExtractDetails extractDetails)
Adds DOM extraction details.- Parameters:
extractDetails
- DOM extraction details- Since:
- 2.6.0
-
getDOMExtractDetailsList
public List<DOMTagger.DOMExtractDetails> getDOMExtractDetailsList()
Gets a list of DOM extraction details.- Returns:
- list of DOM extraction details.
- Since:
- 2.6.0
-
removeDOMExtractDetails
public void removeDOMExtractDetails(String selector)
Removes the DOM extraction details matching the given selector- Parameters:
selector
- DOM selector- Since:
- 2.6.0
-
loadHandlerFromXML
protected void loadHandlerFromXML(XML xml)
Description copied from class:AbstractImporterHandler
Loads configuration settings specific to the implementing class.- Specified by:
loadHandlerFromXML
in classAbstractImporterHandler
- Parameters:
xml
- XML configuration
-
saveHandlerToXML
protected void saveHandlerToXML(XML xml)
Description copied from class:AbstractImporterHandler
Saves configuration settings specific to the implementing class.- Specified by:
saveHandlerToXML
in classAbstractImporterHandler
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractImporterHandler
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractImporterHandler
-
toString
public String toString()
- Overrides:
toString
in classAbstractImporterHandler
-
-