public class DOMPreserveTransformer extends AbstractDocumentTransformer
Preserves only one or more elements matching a given selector from
a document content. Applies to HTML, XHTML, or XML document.
To store preserved values into fields, use DOMTagger
instead.
This class constructs a DOM tree from a document or field content. That DOM tree is loaded entirely into memory. Use this transformer with caution if you know you'll need to parse huge files.
The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
Should be used as a pre-parse handler.
By default, this filter is restricted to (applies only to) documents matching
the restrictions returned by
CommonRestrictions.domContentTypes(String)
.
You can specify your own content types if you know they represent a file
with HTML or XML-like markup tags.
When used as a pre-parse handler,
this class attempts to detect the content character
encoding unless the character encoding
was specified using setSourceCharset(String)
. Since document
parsing converts content to UTF-8, UTF-8 is always assumed when
used as a post-parse handler.
You can control what gets preserved
exactly thanks to the "extract" argument of
DOMPreserveTransformer.DOMExtractDetails.setExtract(String)
. Possible values are:
You can specify a defaultValue
on each DOM extraction details. When no match occurred for a given selector,
the default value will be inserted in the modified document content.
When matching blanks (see below) you will get
an empty string as opposed to the default value.
Empty strings and spaces are supported as default values
(the default value is now taken literally).
You can set matchBlanks
to
true
to match elements that are present
but have blank values. Blank values are empty values or values containing
white spaces only. Because white spaces are normalized by the DOM parser,
such matches will always return an empty string (spaces will be trimmed).
By default elements with blank values are not matched and are ignored.
You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.
It is possible to preserve multiple elements or text. Specifying multiple
DOM selector will achieve that. Each potential match is always
performed on the DOM as it was received.
You can use with DOMDeleteTransformer
for additional flexibility.
It is important to note that preserved elements and text may not always form valid XML when put back together. If your goal is to have the Importer parser extracts the raw text from it like any other documents, this is not an issue, but it could be if you want to use the new document content as XML in a different context.
<handler
class="com.norconex.importer.handler.transformer.impl.DOMPreserveTransformer"
parser="[html|xml]"
sourceCharset="(character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<!-- multiple "dom" tags allowed -->
<dom
selector="(selector syntax)"
extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]"
matchBlanks="[false|true]"
defaultValue="(optional value to use when no match)"/>
</handler>
<handler
class="DOMPreserveTransformer">
<dom
selector="div.firstName"
extract="outerHtml"/>
<dom
selector="div.lastName"
extract="outerHtml"/>
</handler>
Given this HTML snippet...
<div> <div class="firstName">Joe</div> <div class="lastName">Dalton</div> <div class="city">Daisy Town</div> </div>
... the above example will result in the document content having the following:
<div class="firstName">Joe</div> <div class="lastName">Dalton</div>
DOMTagger
,
DOMDeleteTransformer
Modifier and Type | Class and Description |
---|---|
static class |
DOMPreserveTransformer.DOMExtractDetails
DOM Extraction Details
|
Constructor and Description |
---|
DOMPreserveTransformer()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
void |
addDOMExtractDetails(DOMPreserveTransformer.DOMExtractDetails extractDetails)
Adds DOM extraction details.
|
boolean |
equals(Object other) |
List<DOMPreserveTransformer.DOMExtractDetails> |
getDOMExtractDetailsList()
Gets a list of DOM extraction details.
|
String |
getParser()
Gets the parser to use when creating the DOM-tree.
|
String |
getSourceCharset()
Gets the assumed source character encoding.
|
int |
hashCode() |
protected void |
loadHandlerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
void |
removeDOMExtractDetails(String selector)
Removes the DOM extraction details matching the given selector
|
void |
removeDOMExtractDetailsList()
Removes all DOM extraction details.
|
protected void |
saveHandlerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setParser(String parser)
Sets the parser to use when creating the DOM-tree.
|
void |
setSourceCharset(String sourceCharset)
Sets the assumed source character encoding.
|
String |
toString() |
protected void |
transformApplicableDocument(HandlerDoc doc,
InputStream document,
OutputStream output,
ParseState parseState) |
transformDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
public String getSourceCharset()
public void setSourceCharset(String sourceCharset)
sourceCharset
- character encoding of the source to be transformedpublic String getParser()
html
(default) or xml
.public void setParser(String parser)
parser
- html
or xml
.protected void transformApplicableDocument(HandlerDoc doc, InputStream document, OutputStream output, ParseState parseState) throws ImporterHandlerException
transformApplicableDocument
in class AbstractDocumentTransformer
ImporterHandlerException
public void addDOMExtractDetails(DOMPreserveTransformer.DOMExtractDetails extractDetails)
extractDetails
- DOM extraction detailspublic List<DOMPreserveTransformer.DOMExtractDetails> getDOMExtractDetailsList()
public void removeDOMExtractDetails(String selector)
selector
- DOM selectorpublic void removeDOMExtractDetailsList()
protected void loadHandlerFromXML(XML xml)
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- XML configurationprotected void saveHandlerToXML(XML xml)
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2023 Norconex Inc.. All rights reserved.