DOMFilter
.@Deprecated public class DOMContentFilter extends AbstractDocumentFilter
Uses a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content to perform filtering based on matching an element/attribute or element/attribute value.
In order to construct a DOM tree, a document content is loaded entirely
into memory. Use this filter with caution if you know you'll need to parse
huge files. You can use RegexContentFilter
instead if this is a
concern.
The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
If an element is referenced without a value to match, its mere presence constitutes a match. If both an element and a regular expression is provided the element value will be retrieved and the regular expression will be applied against it for a match.
Refer to AbstractDocumentFilter
for the inclusion/exclusion logic.
Should be used as a pre-parse handler.
By default, this filter is restricted to (applies only to) documents matching
the restrictions returned by
CommonRestrictions.domContentTypes(String)
.
You can specify your own content types if you know they represent a file
with HTML or XML-like markup tags. For documents that are
incompatible, consider using RegexContentFilter
instead.
When used as a pre-parse handler,
this class attempts to detect the content character
encoding unless the character encoding
was specified using setSourceCharset(String)
. Since document
parsing converts content to UTF-8, UTF-8 is always assumed when
used as a post-parse handler.
It is possible to control what gets extracted
exactly for matching purposes thanks to the "extract" argument of the
new method setExtract(String)
. Possible values are:
You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.
<handler
class="com.norconex.importer.handler.filter.impl.DOMContentFilter"
onMatch="[include|exclude]"
sourceCharset="(character encoding)"
selector="(selector syntax)"
parser="[html|xml]"
extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(optional expression matching selector extracted value)
</valueMatcher>
</handler>
<!-- Exclude an HTML page that has one or more GIF images in it: -->
<handler
class="DOMContentFilter"
selector="img[src$=.gif]"
onMatch="exclude"/>
<!--
Exclude an HTML page that has a paragraph tag with a class called
"disclaimer" and a value containing "skip me":
-->
<handler
class="DOMContentFilter"
selector="p.disclaimer"
onMatch="exclude">
<regex>\bskip me\b</regex>
</handler>
Constructor and Description |
---|
DOMContentFilter()
Deprecated.
|
DOMContentFilter(String regex)
Deprecated.
Since 3.0.0
|
DOMContentFilter(String regex,
OnMatch onMatch)
Deprecated.
Since 3.0.0
|
DOMContentFilter(String regex,
OnMatch onMatch,
boolean caseSensitive)
Deprecated.
Since 3.0.0
|
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other)
Deprecated.
|
String |
getExtract()
Deprecated.
Gets what should be extracted for the value.
|
String |
getParser()
Deprecated.
Gets the parser to use when creating the DOM-tree.
|
String |
getRegex()
Deprecated.
Since 3.0.0, use
getValueMatcher() |
String |
getSelector()
Deprecated.
|
String |
getSourceCharset()
Deprecated.
Gets the assumed source character encoding.
|
TextMatcher |
getValueMatcher()
Deprecated.
Gets this filter text matcher (copy).
|
int |
hashCode()
Deprecated.
|
boolean |
isCaseSensitive()
Deprecated.
Since 3.0.0, use
getValueMatcher() |
protected boolean |
isDocumentMatched(HandlerDoc doc,
InputStream input,
ParseState parseState)
Deprecated.
|
protected void |
loadFilterFromXML(XML xml)
Deprecated.
|
protected void |
saveFilterToXML(XML xml)
Deprecated.
|
void |
setCaseSensitive(boolean caseSensitive)
Deprecated.
Since 3.0.0, use
getValueMatcher() |
void |
setExtract(String extract)
Deprecated.
Sets what should be extracted for the value.
|
void |
setParser(String parser)
Deprecated.
Sets the parser to use when creating the DOM-tree.
|
void |
setRegex(String regex)
Deprecated.
Since 3.0.0, use
getValueMatcher() |
void |
setSelector(String selector)
Deprecated.
|
void |
setSourceCharset(String sourceCharset)
Deprecated.
Sets the assumed source character encoding.
|
void |
setValueMatcher(TextMatcher valueMatcher)
Deprecated.
Sets this filter text matcher (copy).
|
String |
toString()
Deprecated.
|
acceptDocument, getOnMatch, loadHandlerFromXML, saveHandlerToXML, setOnMatch
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
public DOMContentFilter()
@Deprecated public DOMContentFilter(String regex)
regex
- regular expression@Deprecated public DOMContentFilter(String regex, OnMatch onMatch)
regex
- regular expressiononMatch
- on match instruction@Deprecated public DOMContentFilter(String regex, OnMatch onMatch, boolean caseSensitive)
regex
- regular expressiononMatch
- on match instructioncaseSensitive
- whether regular expression is case sensitive@Deprecated public String getRegex()
getValueMatcher()
@Deprecated public final void setRegex(String regex)
getValueMatcher()
regex
- expression@Deprecated public boolean isCaseSensitive()
getValueMatcher()
true
if case sensitive@Deprecated public void setCaseSensitive(boolean caseSensitive)
getValueMatcher()
caseSensitive
- true
if case sensitivepublic String getSelector()
public void setSelector(String selector)
public TextMatcher getValueMatcher()
public void setValueMatcher(TextMatcher valueMatcher)
valueMatcher
- text matcherpublic String getExtract()
null
means
this class will use the default ("text").public void setExtract(String extract)
null
means
this class will use the default ("text").extract
- what should be extracted for the valuepublic String getSourceCharset()
public void setSourceCharset(String sourceCharset)
sourceCharset
- character encoding of the source to be transformedpublic String getParser()
html
(default) or xml
.public void setParser(String parser)
parser
- html
or xml
.protected boolean isDocumentMatched(HandlerDoc doc, InputStream input, ParseState parseState) throws ImporterHandlerException
isDocumentMatched
in class AbstractDocumentFilter
ImporterHandlerException
protected void loadFilterFromXML(XML xml)
loadFilterFromXML
in class AbstractDocumentFilter
protected void saveFilterToXML(XML xml)
saveFilterToXML
in class AbstractDocumentFilter
public boolean equals(Object other)
equals
in class AbstractDocumentFilter
public int hashCode()
hashCode
in class AbstractDocumentFilter
public String toString()
toString
in class AbstractDocumentFilter
Copyright © 2009–2023 Norconex Inc.. All rights reserved.