Class DOMContentFilter
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.filter.AbstractDocumentFilter
-
- com.norconex.importer.handler.filter.impl.DOMContentFilter
-
- All Implemented Interfaces:
IXMLConfigurable
,IDocumentFilter
,IOnMatchFilter
,IImporterHandler
@Deprecated public class DOMContentFilter extends AbstractDocumentFilter
Deprecated.Since 3.0.0, useDOMFilter
.Uses a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content to perform filtering based on matching an element/attribute or element/attribute value.
In order to construct a DOM tree, a document content is loaded entirely into memory. Use this filter with caution if you know you'll need to parse huge files. You can use
RegexContentFilter
instead if this is a concern.The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.
If an element is referenced without a value to match, its mere presence constitutes a match. If both an element and a regular expression is provided the element value will be retrieved and the regular expression will be applied against it for a match.
Refer to
AbstractDocumentFilter
for the inclusion/exclusion logic.Should be used as a pre-parse handler.
Content-types
By default, this filter is restricted to (applies only to) documents matching the restrictions returned by
CommonRestrictions.domContentTypes(String)
. You can specify your own content types if you know they represent a file with HTML or XML-like markup tags. For documents that are incompatible, consider usingRegexContentFilter
instead.When used as a pre-parse handler, this class attempts to detect the content character encoding unless the character encoding was specified using
setSourceCharset(String)
. Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.It is possible to control what gets extracted exactly for matching purposes thanks to the "extract" argument of the new method
setExtract(String)
. Possible values are:- text: Default option when extract is blank. The text of the element, including combined children.
- html: Extracts an element inner HTML (including children).
- outerHtml: Extracts an element outer HTML (like "html", but includes the "current" tag).
- ownText: Extracts the text owned by this element only; does not get the combined text of all children.
- data: Extracts the combined data of a data-element (e.g. <script>).
- id: Extracts the ID attribute of the element (if any).
- tagName: Extract the name of the tag of the element.
- val: Extracts the value of a form element (input, textarea, etc).
- className: Extracts the literal value of the element's "class" attribute, which may include multiple class names, space separated.
- cssSelector: Extracts a CSS selector that will uniquely select (identify) this element.
- attr(attributeKey): Extracts the value of the element attribute matching your replacement for "attributeKey" (e.g. "attr(title)" will extract the "title" attribute).
You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.
XML configuration usage:
<handler class="com.norconex.importer.handler.filter.impl.DOMContentFilter" onMatch="[include|exclude]" sourceCharset="(character encoding)" selector="(selector syntax)" parser="[html|xml]" extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> <valueMatcher> (optional expression matching selector extracted value) </valueMatcher> </handler>
XML usage example:
<!-- Exclude an HTML page that has one or more GIF images in it: --> <handler class="DOMContentFilter" selector="img[src$=.gif]" onMatch="exclude"/> <!-- Exclude an HTML page that has a paragraph tag with a class called "disclaimer" and a value containing "skip me": --> <handler class="DOMContentFilter" selector="p.disclaimer" onMatch="exclude"> <regex>\bskip me\b</regex> </handler>
- Since:
- 2.4.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description DOMContentFilter()
Deprecated.DOMContentFilter(String regex)
Deprecated.Since 3.0.0DOMContentFilter(String regex, OnMatch onMatch)
Deprecated.Since 3.0.0DOMContentFilter(String regex, OnMatch onMatch, boolean caseSensitive)
Deprecated.Since 3.0.0
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description boolean
equals(Object other)
Deprecated.String
getExtract()
Deprecated.Gets what should be extracted for the value.String
getParser()
Deprecated.Gets the parser to use when creating the DOM-tree.String
getRegex()
Deprecated.Since 3.0.0, usegetValueMatcher()
String
getSelector()
Deprecated.String
getSourceCharset()
Deprecated.Gets the assumed source character encoding.TextMatcher
getValueMatcher()
Deprecated.Gets this filter text matcher (copy).int
hashCode()
Deprecated.boolean
isCaseSensitive()
Deprecated.Since 3.0.0, usegetValueMatcher()
protected boolean
isDocumentMatched(HandlerDoc doc, InputStream input, ParseState parseState)
Deprecated.protected void
loadFilterFromXML(XML xml)
Deprecated.protected void
saveFilterToXML(XML xml)
Deprecated.void
setCaseSensitive(boolean caseSensitive)
Deprecated.Since 3.0.0, usegetValueMatcher()
void
setExtract(String extract)
Deprecated.Sets what should be extracted for the value.void
setParser(String parser)
Deprecated.Sets the parser to use when creating the DOM-tree.void
setRegex(String regex)
Deprecated.Since 3.0.0, usegetValueMatcher()
void
setSelector(String selector)
Deprecated.void
setSourceCharset(String sourceCharset)
Deprecated.Sets the assumed source character encoding.void
setValueMatcher(TextMatcher valueMatcher)
Deprecated.Sets this filter text matcher (copy).String
toString()
Deprecated.-
Methods inherited from class com.norconex.importer.handler.filter.AbstractDocumentFilter
acceptDocument, getOnMatch, loadHandlerFromXML, saveHandlerToXML, setOnMatch
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
-
-
-
Constructor Detail
-
DOMContentFilter
public DOMContentFilter()
Deprecated.
-
DOMContentFilter
@Deprecated public DOMContentFilter(String regex)
Deprecated.Since 3.0.0Constructor.- Parameters:
regex
- regular expression
-
DOMContentFilter
@Deprecated public DOMContentFilter(String regex, OnMatch onMatch)
Deprecated.Since 3.0.0Constructor.- Parameters:
regex
- regular expressiononMatch
- on match instruction
-
DOMContentFilter
@Deprecated public DOMContentFilter(String regex, OnMatch onMatch, boolean caseSensitive)
Deprecated.Since 3.0.0Constructor.- Parameters:
regex
- regular expressiononMatch
- on match instructioncaseSensitive
- whether regular expression is case sensitive
-
-
Method Detail
-
getRegex
@Deprecated public String getRegex()
Deprecated.Since 3.0.0, usegetValueMatcher()
Gets the expression matching text extracted by selector.- Returns:
- expression
-
setRegex
@Deprecated public final void setRegex(String regex)
Deprecated.Since 3.0.0, usegetValueMatcher()
Sets the expression matching text extracted by selector.- Parameters:
regex
- expression
-
isCaseSensitive
@Deprecated public boolean isCaseSensitive()
Deprecated.Since 3.0.0, usegetValueMatcher()
Gets whether expression matching text extracted by selector is case sensitive.- Returns:
true
if case sensitive
-
setCaseSensitive
@Deprecated public void setCaseSensitive(boolean caseSensitive)
Deprecated.Since 3.0.0, usegetValueMatcher()
Sets whether expression matching text extracted by selector is case sensitive.- Parameters:
caseSensitive
-true
if case sensitive
-
getSelector
public String getSelector()
Deprecated.
-
setSelector
public void setSelector(String selector)
Deprecated.
-
getValueMatcher
public TextMatcher getValueMatcher()
Deprecated.Gets this filter text matcher (copy).- Returns:
- text matcher
- Since:
- 3.0.0
-
setValueMatcher
public void setValueMatcher(TextMatcher valueMatcher)
Deprecated.Sets this filter text matcher (copy).- Parameters:
valueMatcher
- text matcher- Since:
- 3.0.0
-
getExtract
public String getExtract()
Deprecated.Gets what should be extracted for the value. One of "text" (default), "html", or "outerHtml".null
means this class will use the default ("text").- Returns:
- what should be extracted for the value
- Since:
- 2.5.0
-
setExtract
public void setExtract(String extract)
Deprecated.Sets what should be extracted for the value. One of "text" (default), "html", or "outerHtml".null
means this class will use the default ("text").- Parameters:
extract
- what should be extracted for the value- Since:
- 2.5.0
-
getSourceCharset
public String getSourceCharset()
Deprecated.Gets the assumed source character encoding.- Returns:
- character encoding of the source to be transformed
- Since:
- 2.5.0
-
setSourceCharset
public void setSourceCharset(String sourceCharset)
Deprecated.Sets the assumed source character encoding.- Parameters:
sourceCharset
- character encoding of the source to be transformed- Since:
- 2.5.0
-
getParser
public String getParser()
Deprecated.Gets the parser to use when creating the DOM-tree.- Returns:
html
(default) orxml
.- Since:
- 2.8.0
-
setParser
public void setParser(String parser)
Deprecated.Sets the parser to use when creating the DOM-tree.- Parameters:
parser
-html
orxml
.- Since:
- 2.8.0
-
isDocumentMatched
protected boolean isDocumentMatched(HandlerDoc doc, InputStream input, ParseState parseState) throws ImporterHandlerException
Deprecated.- Specified by:
isDocumentMatched
in classAbstractDocumentFilter
- Throws:
ImporterHandlerException
-
loadFilterFromXML
protected void loadFilterFromXML(XML xml)
Deprecated.- Specified by:
loadFilterFromXML
in classAbstractDocumentFilter
-
saveFilterToXML
protected void saveFilterToXML(XML xml)
Deprecated.- Specified by:
saveFilterToXML
in classAbstractDocumentFilter
-
equals
public boolean equals(Object other)
Deprecated.- Overrides:
equals
in classAbstractDocumentFilter
-
hashCode
public int hashCode()
Deprecated.- Overrides:
hashCode
in classAbstractDocumentFilter
-
toString
public String toString()
Deprecated.- Overrides:
toString
in classAbstractDocumentFilter
-
-