public class DOMCondition extends AbstractCharStreamCondition
A condition using a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content to match an element, attribute or value.
In order to construct a DOM tree, text is loaded entirely
into memory. It uses the document content to create the DOM by default,
but it can also use metadata fields. If more than one metadata field
values are identified as the source of DOM content, only one needs to
match for this condition to be true
.
Use this condition with caution if you know you'll need to parse
huge files. You can use TextFilter
instead if this is a
concern.
The jsoup parser library is used to load the content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
The use of a value matcher is optional. Without one, any element found by the provided DOM selector will constitute a match. If both a DOM selector and a value matcher are provided, the matching selector element value(s) will be retrieved and the value matcher will be applied against it (or them) for a match.
It is possible to control what gets extracted
exactly for matching purposes thanks to the "extract" argument of the
new method setExtract(String)
. Possible values are:
Should be used as a pre-parse handler.
If you are dealing with multiple document types and you are using this condition on the document content, it is important to restrict this condition to text-based XML-like content only to prevent DOM-parsing errors.
By default this condition only applies to documents matching
the content types listed in CommonMatchers.domContentTypes()
.
Other content types always make this condition false
.
You can overwrite these default content types by providing your own content type matcher. Make sure the content types you use represent a file with HTML or XML-like markup tags.
When used as a pre-parse handler, this class will use detected or previously set content character encoding unless the character encoding was specified using {@link #setSourceCharset(String)}. Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.
When used as a pre-parse handler, this condition uses the detected
character encoding unless the character encoding
was specified using AbstractCharStreamCondition.setSourceCharset(String)
. Since document
parsing should always converts content to UTF-8, UTF-8 is always
assumed when used as a post-parse handler.
You can specify which DOM parser to use when reading documents. The default is "html" and will try to normalize/fix the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" is a good option.
<handler
class="com.norconex.importer.handler.condition.impl.DOMCondition"
sourceCharset="(character encoding)"
selector="(selector syntax)"
parser="[html|xml]"
extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]">
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(Optional expression matching one or more fields where the DOM text is
located.)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(Optional expression matching selector extracted value.)
</valueMatcher>
<contentTypeMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(Optional expression overwriting the content types this condition applies
to.)
</contentTypeMatcher>
</handler>
<!-- Matches an HTML page that has one or more GIF images in it: -->
<condition
class="DOMCondition"
selector="img[src$=.gif]"
onMatch="exclude"/>
<!--
Matches an HTML page that has a paragraph tag with a class called
"disclaimer" and a value containing "skip me":
-->
<condition
class="DOMCondition"
selector="p.disclaimer"
onMatch="exclude">
<valueMatcher
method="regex">
\bskip me\b
</valueMatcher>
</condition>
Constructor and Description |
---|
DOMCondition() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
TextMatcher |
getContentTypeMatcher()
Gets this condition content-type matcher.
|
String |
getExtract()
Gets what should be extracted for the value.
|
TextMatcher |
getFieldMatcher()
Gets this filter field matcher.
|
String |
getParser()
Gets the parser to use when creating the DOM-tree.
|
String |
getSelector() |
TextMatcher |
getValueMatcher()
Gets this condition value matcher.
|
int |
hashCode() |
protected void |
loadCharStreamConditionFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveCharStreamConditionToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setContentTypeMatcher(TextMatcher contentTypeMatcher)
Sets this condition content-type matcher.
|
void |
setExtract(String extract)
Sets what should be extracted for the value.
|
void |
setFieldMatcher(TextMatcher fieldMatcher)
Sets this condition field matcher.
|
void |
setParser(String parser)
Sets the parser to use when creating the DOM-tree.
|
void |
setSelector(String selector) |
void |
setValueMatcher(TextMatcher valueMatcher)
Sets this condition value matcher.
|
protected boolean |
testDocument(HandlerDoc doc,
Reader input,
ParseState parseState) |
String |
toString() |
getSourceCharset, loadFromXML, saveToXML, setSourceCharset, testDocument
public TextMatcher getFieldMatcher()
public void setFieldMatcher(TextMatcher fieldMatcher)
fieldMatcher
- field matcherpublic TextMatcher getValueMatcher()
public void setValueMatcher(TextMatcher valueMatcher)
valueMatcher
- value matcherpublic TextMatcher getContentTypeMatcher()
public void setContentTypeMatcher(TextMatcher contentTypeMatcher)
contentTypeMatcher
- content-type matcherpublic String getExtract()
null
means
this class will use the default ("text").public void setExtract(String extract)
null
means
this class will use the default ("text").extract
- what should be extracted for the valuepublic String getParser()
html
(default) or xml
.public void setParser(String parser)
parser
- html
or xml
.public String getSelector()
public void setSelector(String selector)
protected boolean testDocument(HandlerDoc doc, Reader input, ParseState parseState) throws ImporterHandlerException
testDocument
in class AbstractCharStreamCondition
ImporterHandlerException
protected void loadCharStreamConditionFromXML(XML xml)
AbstractCharStreamCondition
loadCharStreamConditionFromXML
in class AbstractCharStreamCondition
xml
- XML configurationprotected void saveCharStreamConditionToXML(XML xml)
AbstractCharStreamCondition
saveCharStreamConditionToXML
in class AbstractCharStreamCondition
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractCharStreamCondition
public int hashCode()
hashCode
in class AbstractCharStreamCondition
public String toString()
toString
in class AbstractCharStreamCondition
Copyright © 2009–2023 Norconex Inc.. All rights reserved.