DOMCondition (Norconex Importer 3.0.1 API)

java.lang.Object
- com.norconex.importer.handler.condition.AbstractCharStreamCondition
- - com.norconex.importer.handler.condition.impl.DOMCondition

All Implemented Interfaces:

IXMLConfigurable, IImporterCondition
```
public class DOMCondition
extends AbstractCharStreamCondition
```
A condition using a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content to match an element, attribute or value.

In order to construct a DOM tree, text is loaded entirely into memory. It uses the document content to create the DOM by default, but it can also use metadata fields. If more than one metadata field values are identified as the source of DOM content, only one needs to match for this condition to be true. Use this condition with caution if you know you'll need to parse huge files. You can use TextFilter instead if this is a concern.

The jsoup parser library is used to load the content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.

The use of a value matcher is optional. Without one, any element found by the provided DOM selector will constitute a match. If both a DOM selector and a value matcher are provided, the matching selector element value(s) will be retrieved and the value matcher will be applied against it (or them) for a match.

It is possible to control what gets extracted exactly for matching purposes thanks to the "extract" argument of the new method setExtract(String). Possible values are:
- text: Default option when extract is blank. The text of the element, including combined children.
- html: Extracts an element inner HTML (including children).
- outerHtml: Extracts an element outer HTML (like "html", but includes the "current" tag).
- ownText: Extracts the text owned by this element only; does not get the combined text of all children.
- data: Extracts the combined data of a data-element (e.g. <script>).
- id: Extracts the ID attribute of the element (if any).
- tagName: Extract the name of the tag of the element.
- val: Extracts the value of a form element (input, textarea, etc).
- className: Extracts the literal value of the element's "class" attribute, which may include multiple class names, space separated.
- cssSelector: Extracts a CSS selector that will uniquely select (identify) this element.
- attr(attributeKey): Extracts the value of the element attribute matching your replacement for "attributeKey" (e.g. "attr(title)" will extract the "title" attribute).
Should be used as a pre-parse handler.

Content-types

If you are dealing with multiple document types and you are using this condition on the document content, it is important to restrict this condition to text-based XML-like content only to prevent DOM-parsing errors.

By default this condition only applies to documents matching the content types listed in CommonMatchers.domContentTypes(). Other content types always make this condition false.

You can overwrite these default content types by providing your own content type matcher. Make sure the content types you use represent a file with HTML or XML-like markup tags.

Character encoding

When used as a pre-parse handler, this class will use detected or previously set content character encoding unless the character encoding was specified using {@link #setSourceCharset(String)}. Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.

Character encoding

When used as a pre-parse handler, this condition uses the detected character encoding unless the character encoding was specified using AbstractCharStreamCondition.setSourceCharset(String). Since document parsing should always converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.

XML vs HTML

You can specify which DOM parser to use when reading documents. The default is "html" and will try to normalize/fix the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" is a good option.

XML configuration usage:
```
<handler
    class="com.norconex.importer.handler.condition.impl.DOMCondition"
    sourceCharset="(character encoding)"
    selector="(selector syntax)"
    parser="[html|xml]"
    extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]">
  <fieldMatcher
      method="[basic|csv|wildcard|regex]"
      ignoreCase="[false|true]"
      ignoreDiacritic="[false|true]"
      partial="[false|true]">
    (Optional expression matching one or more fields where the DOM text is
    located.)
  </fieldMatcher>
  <valueMatcher
      method="[basic|csv|wildcard|regex]"
      ignoreCase="[false|true]"
      ignoreDiacritic="[false|true]"
      partial="[false|true]">
    (Optional expression matching selector extracted value.)
  </valueMatcher>
  <contentTypeMatcher
      method="[basic|csv|wildcard|regex]"
      ignoreCase="[false|true]"
      ignoreDiacritic="[false|true]"
      partial="[false|true]">
    (Optional expression overwriting the content types this condition applies
    to.)
  </contentTypeMatcher>
</handler>
```
XML usage example:
```

<condition
    class="DOMCondition"
    selector="img[src$=.gif]"
    onMatch="exclude"/>

<condition
    class="DOMCondition"
    selector="p.disclaimer"
    onMatch="exclude">
  <valueMatcher
      method="regex">
    \bskip me\b
  </valueMatcher>
</condition>
```
Since:

3.0.0

Author:

Pascal Essiembre

Constructor Summary

Constructors
Constructor and Description

DOMCondition()

Constructors
Constructor and Description
`DOMCondition()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`equals(Object other)`
`TextMatcher`	`getContentTypeMatcher()` Gets this condition content-type matcher.
`String`	`getExtract()` Gets what should be extracted for the value.
`TextMatcher`	`getFieldMatcher()` Gets this filter field matcher.
`String`	`getParser()` Gets the parser to use when creating the DOM-tree.
`String`	`getSelector()`
`TextMatcher`	`getValueMatcher()` Gets this condition value matcher.
`int`	`hashCode()`
`protected void`	`loadCharStreamConditionFromXML(XML xml)` Loads configuration settings specific to the implementing class.
`protected void`	`saveCharStreamConditionToXML(XML xml)` Saves configuration settings specific to the implementing class.
`void`	`setContentTypeMatcher(TextMatcher contentTypeMatcher)` Sets this condition content-type matcher.
`void`	`setExtract(String extract)` Sets what should be extracted for the value.
`void`	`setFieldMatcher(TextMatcher fieldMatcher)` Sets this condition field matcher.
`void`	`setParser(String parser)` Sets the parser to use when creating the DOM-tree.
`void`	`setSelector(String selector)`
`void`	`setValueMatcher(TextMatcher valueMatcher)` Sets this condition value matcher.
`protected boolean`	`testDocument(HandlerDoc doc, Reader input, ParseState parseState)`
`String`	`toString()`

Methods inherited from class com.norconex.importer.handler.condition.AbstractCharStreamCondition
getSourceCharset, loadFromXML, saveToXML, setSourceCharset, testDocument

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - DOMCondition
```
public DOMCondition()
```
- Method Detail
  - getFieldMatcher
```
public TextMatcher getFieldMatcher()
```
    Gets this filter field matcher.
    
    Returns:
    
    field matcher
  - setFieldMatcher
```
public void setFieldMatcher(TextMatcher fieldMatcher)
```
    Sets this condition field matcher.
    
    Parameters:
    
    fieldMatcher - field matcher
  - getValueMatcher
```
public TextMatcher getValueMatcher()
```
    Gets this condition value matcher.
    
    Returns:
    
    value matcher
  - setValueMatcher
```
public void setValueMatcher(TextMatcher valueMatcher)
```
    Sets this condition value matcher.
    
    Parameters:
    
    valueMatcher - value matcher
  - getContentTypeMatcher
```
public TextMatcher getContentTypeMatcher()
```
    Gets this condition content-type matcher.
    
    Returns:
    
    content-type matcher
  - setContentTypeMatcher
```
public void setContentTypeMatcher(TextMatcher contentTypeMatcher)
```
    Sets this condition content-type matcher.
    
    Parameters:
    
    contentTypeMatcher - content-type matcher
  - getExtract
```
public String getExtract()
```
    Gets what should be extracted for the value. One of "text" (default), "html", or "outerHtml". null means this class will use the default ("text").
    
    Returns:
    
    what should be extracted for the value
  - setExtract
```
public void setExtract(String extract)
```
    Sets what should be extracted for the value. One of "text" (default), "html", or "outerHtml". null means this class will use the default ("text").
    
    Parameters:
    
    extract - what should be extracted for the value
  - getParser
```
public String getParser()
```
    Gets the parser to use when creating the DOM-tree.
    
    Returns:
    
    html (default) or xml.
  - setParser
```
public void setParser(String parser)
```
    Sets the parser to use when creating the DOM-tree.
    
    Parameters:
    
    parser - html or xml.
  - getSelector
```
public String getSelector()
```
  - setSelector
```
public void setSelector(String selector)
```
  - testDocument
```
protected boolean testDocument(HandlerDoc doc,
                               Reader input,
                               ParseState parseState)
                        throws ImporterHandlerException
```
    Specified by:
    
    testDocument in class AbstractCharStreamCondition
    
    Throws:
    
    ImporterHandlerException
  - loadCharStreamConditionFromXML
```
protected void loadCharStreamConditionFromXML(XML xml)
```
    Description copied from class: AbstractCharStreamCondition
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadCharStreamConditionFromXML in class AbstractCharStreamCondition
    
    Parameters:
    
    xml - XML configuration
  - saveCharStreamConditionToXML
```
protected void saveCharStreamConditionToXML(XML xml)
```
    Description copied from class: AbstractCharStreamCondition
    
    Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written.
    
    Specified by:
    
    saveCharStreamConditionToXML in class AbstractCharStreamCondition
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractCharStreamCondition
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractCharStreamCondition
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractCharStreamCondition

Class DOMCondition

Content-types

Character encoding

Character encoding

XML vs HTML

XML configuration usage:

XML usage example:

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.condition.AbstractCharStreamCondition

Methods inherited from class java.lang.Object

Constructor Detail

DOMCondition

Method Detail

getFieldMatcher

setFieldMatcher

getValueMatcher

setValueMatcher

getContentTypeMatcher

setContentTypeMatcher

getExtract

setExtract

getParser

setParser

getSelector

setSelector

testDocument

loadCharStreamConditionFromXML

saveCharStreamConditionToXML

equals

hashCode

toString