DOMFilter (Norconex Importer 3.0.1 API)

java.lang.Object
- com.norconex.importer.handler.AbstractImporterHandler
- - com.norconex.importer.handler.filter.AbstractDocumentFilter
  - - com.norconex.importer.handler.filter.impl.DOMFilter

All Implemented Interfaces:

IXMLConfigurable, IDocumentFilter, IOnMatchFilter, IImporterHandler
```
public class DOMFilter
extends AbstractDocumentFilter
```
Uses a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content to perform filtering based on matching an element/attribute or element/attribute value.

In order to construct a DOM tree, text is loaded entirely into memory. It uses the document content by default, but it can also come from specified metadata fields. If multiple fields values are identified/matched as DOM sources, only one needs to match for the filter to be applied. Use this filter with caution if you know you'll need to parse huge files. You can use TextFilter instead if this is a concern.

The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.

If an element is referenced without a value to match, its mere presence constitutes a match. If both an element and a regular expression is provided the element value will be retrieved and the regular expression will be applied against it for a match.

Refer to AbstractDocumentFilter for the inclusion/exclusion logic.

Should be used as a pre-parse handler.

Content-types

By default, this filter is restricted to (applies only to) documents matching the restrictions returned by CommonRestrictions.domContentTypes(String). You can specify your own content types if you know they represent a file with HTML or XML-like markup tags. For documents that are incompatible, consider using RegexContentFilter instead.

When used as a pre-parse handler, this class attempts to detect the content character encoding unless the character encoding was specified using setSourceCharset(String). Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.

It is possible to control what gets extracted exactly for matching purposes thanks to the "extract" argument of the new method setExtract(String). Possible values are:
- text: Default option when extract is blank. The text of the element, including combined children.
- html: Extracts an element inner HTML (including children).
- outerHtml: Extracts an element outer HTML (like "html", but includes the "current" tag).
- ownText: Extracts the text owned by this element only; does not get the combined text of all children.
- data: Extracts the combined data of a data-element (e.g. <script>).
- id: Extracts the ID attribute of the element (if any).
- tagName: Extract the name of the tag of the element.
- val: Extracts the value of a form element (input, textarea, etc).
- className: Extracts the literal value of the element's "class" attribute, which may include multiple class names, space separated.
- cssSelector: Extracts a CSS selector that will uniquely select (identify) this element.
- attr(attributeKey): Extracts the value of the element attribute matching your replacement for "attributeKey" (e.g. "attr(title)" will extract the "title" attribute).
You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.

XML configuration usage:
```
<handler
    class="com.norconex.importer.handler.filter.impl.DOMContentFilter"
    onMatch="[include|exclude]"
    sourceCharset="(character encoding)"
    selector="(selector syntax)"
    parser="[html|xml]"
    extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]">
  
  <restrictTo>
    <fieldMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (field-matching expression)
    </fieldMatcher>
    <valueMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (value-matching expression)
    </valueMatcher>
  </restrictTo>
  <fieldMatcher
      method="[basic|csv|wildcard|regex]"
      ignoreCase="[false|true]"
      ignoreDiacritic="[false|true]"
      partial="[false|true]">
    (optional expression matching fields where the DOM text is located)
  </fieldMatcher>
  <valueMatcher
      method="[basic|csv|wildcard|regex]"
      ignoreCase="[false|true]"
      ignoreDiacritic="[false|true]"
      partial="[false|true]">
    (optional expression matching selector extracted value)
  </valueMatcher>
</handler>
```
XML usage example:
```

<handler
    class="DOMContentFilter"
    selector="img[src$=.gif]"
    onMatch="exclude"/>

<handler
    class="DOMContentFilter"
    selector="p.disclaimer"
    onMatch="exclude">
  <valueMatcher
      method="regex">
    \bskip me\b
  </valueMatcher>
</handler>
```
Since:

3.0.0

Author:

Pascal Essiembre

Constructor Summary

Constructors
Constructor and Description

DOMFilter()

Constructors
Constructor and Description
`DOMFilter()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`equals(Object other)`
`String`	`getExtract()` Gets what should be extracted for the value.
`TextMatcher`	`getFieldMatcher()` Gets this filter field matcher (copy).
`String`	`getParser()` Gets the parser to use when creating the DOM-tree.
`String`	`getSelector()`
`String`	`getSourceCharset()` Gets the assumed source character encoding.
`TextMatcher`	`getValueMatcher()` Gets this filter value matcher (copy).
`int`	`hashCode()`
`protected boolean`	`isDocumentMatched(HandlerDoc doc, InputStream input, ParseState parseState)`
`protected void`	`loadFilterFromXML(XML xml)`
`protected void`	`saveFilterToXML(XML xml)`
`void`	`setExtract(String extract)` Sets what should be extracted for the value.
`void`	`setFieldMatcher(TextMatcher fieldMatcher)` Sets this filter field matcher (copy).
`void`	`setParser(String parser)` Sets the parser to use when creating the DOM-tree.
`void`	`setSelector(String selector)`
`void`	`setSourceCharset(String sourceCharset)` Sets the assumed source character encoding.
`void`	`setValueMatcher(TextMatcher valueMatcher)` Sets this filter value matcher (copy).
`String`	`toString()`

Methods inherited from class com.norconex.importer.handler.filter.AbstractDocumentFilter
acceptDocument, getOnMatch, loadHandlerFromXML, saveHandlerToXML, setOnMatch

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - DOMFilter
```
public DOMFilter()
```
- Method Detail
  - getSelector
```
public String getSelector()
```
  - setSelector
```
public void setSelector(String selector)
```
  - getFieldMatcher
```
public TextMatcher getFieldMatcher()
```
    Gets this filter field matcher (copy).
    
    Returns:
    
    field matcher
  - setFieldMatcher
```
public void setFieldMatcher(TextMatcher fieldMatcher)
```
    Sets this filter field matcher (copy).
    
    Parameters:
    
    fieldMatcher - field matcher
  - getValueMatcher
```
public TextMatcher getValueMatcher()
```
    Gets this filter value matcher (copy).
    
    Returns:
    
    value matcher
  - setValueMatcher
```
public void setValueMatcher(TextMatcher valueMatcher)
```
    Sets this filter value matcher (copy).
    
    Parameters:
    
    valueMatcher - value matcher
  - getExtract
```
public String getExtract()
```
    Gets what should be extracted for the value. One of "text" (default), "html", or "outerHtml". null means this class will use the default ("text").
    
    Returns:
    
    what should be extracted for the value
  - setExtract
```
public void setExtract(String extract)
```
    Sets what should be extracted for the value. One of "text" (default), "html", or "outerHtml". null means this class will use the default ("text").
    
    Parameters:
    
    extract - what should be extracted for the value
  - getSourceCharset
```
public String getSourceCharset()
```
    Gets the assumed source character encoding.
    
    Returns:
    
    character encoding of the source to be transformed
  - setSourceCharset
```
public void setSourceCharset(String sourceCharset)
```
    Sets the assumed source character encoding.
    
    Parameters:
    
    sourceCharset - character encoding of the source to be transformed
  - getParser
```
public String getParser()
```
    Gets the parser to use when creating the DOM-tree.
    
    Returns:
    
    html (default) or xml.
  - setParser
```
public void setParser(String parser)
```
    Sets the parser to use when creating the DOM-tree.
    
    Parameters:
    
    parser - html or xml.
  - isDocumentMatched
```
protected boolean isDocumentMatched(HandlerDoc doc,
                                    InputStream input,
                                    ParseState parseState)
                             throws ImporterHandlerException
```
    Specified by:
    
    isDocumentMatched in class AbstractDocumentFilter
    
    Throws:
    
    ImporterHandlerException
  - loadFilterFromXML
```
protected void loadFilterFromXML(XML xml)
```
    Specified by:
    
    loadFilterFromXML in class AbstractDocumentFilter
  - saveFilterToXML
```
protected void saveFilterToXML(XML xml)
```
    Specified by:
    
    saveFilterToXML in class AbstractDocumentFilter
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractDocumentFilter
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractDocumentFilter
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractDocumentFilter

Class DOMFilter

Content-types

XML configuration usage:

XML usage example:

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.filter.AbstractDocumentFilter

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler

Methods inherited from class java.lang.Object

Constructor Detail

DOMFilter

Method Detail

getSelector

setSelector

getFieldMatcher

setFieldMatcher

getValueMatcher

setValueMatcher

getExtract

setExtract

getSourceCharset

setSourceCharset

getParser

setParser

isDocumentMatched

loadFilterFromXML

saveFilterToXML

equals

hashCode

toString