DOMPreserveTransformer (Norconex Importer 3.0.1 API)

java.lang.Object
- com.norconex.importer.handler.AbstractImporterHandler
- - com.norconex.importer.handler.transformer.AbstractDocumentTransformer
  - - com.norconex.importer.handler.transformer.impl.DOMPreserveTransformer

All Implemented Interfaces:

IXMLConfigurable, IImporterHandler, IDocumentTransformer
```
public class DOMPreserveTransformer
extends AbstractDocumentTransformer
```
Preserves only one or more elements matching a given selector from a document content. Applies to HTML, XHTML, or XML document. To store preserved values into fields, use DOMTagger instead.

This class constructs a DOM tree from a document or field content. That DOM tree is loaded entirely into memory. Use this transformer with caution if you know you'll need to parse huge files.

The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.

Should be used as a pre-parse handler.

Content-types

By default, this filter is restricted to (applies only to) documents matching the restrictions returned by CommonRestrictions.domContentTypes(String). You can specify your own content types if you know they represent a file with HTML or XML-like markup tags.

When used as a pre-parse handler, this class attempts to detect the content character encoding unless the character encoding was specified using setSourceCharset(String). Since document parsing converts content to UTF-8, UTF-8 is always assumed when used as a post-parse handler.

You can control what gets preserved exactly thanks to the "extract" argument of DOMPreserveTransformer.DOMExtractDetails.setExtract(String). Possible values are:
- text: Default option when extract is blank. The text of the element, including combined children.
- html: Extracts an element inner HTML (including children).
- outerHtml: Extracts an element outer HTML (like "html", but includes the "current" tag).
- ownText: Extracts the text owned by this element only; does not get the combined text of all children.
- data: Extracts the combined data of a data-element (e.g. <script>).
- id: Extracts the ID attribute of the element (if any).
- tagName: Extract the name of the tag of the element.
- val: Extracts the value of a form element (input, textarea, etc).
- className: Extracts the literal value of the element's "class" attribute, which may include multiple class names, space separated.
- cssSelector: Extracts a CSS selector that will uniquely select (identify) this element.
- attr(attributeKey): Extracts the value of the element attribute matching your replacement for "attributeKey" (e.g. "attr(title)" will extract the "title" attribute).
You can specify a defaultValue on each DOM extraction details. When no match occurred for a given selector, the default value will be inserted in the modified document content. When matching blanks (see below) you will get an empty string as opposed to the default value. Empty strings and spaces are supported as default values (the default value is now taken literally).

You can set matchBlanks to true to match elements that are present but have blank values. Blank values are empty values or values containing white spaces only. Because white spaces are normalized by the DOM parser, such matches will always return an empty string (spaces will be trimmed). By default elements with blank values are not matched and are ignored.

You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.

Multiple preserved elements

It is possible to preserve multiple elements or text. Specifying multiple DOM selector will achieve that. Each potential match is always performed on the DOM as it was received. You can use with DOMDeleteTransformer for additional flexibility.

It is important to note that preserved elements and text may not always form valid XML when put back together. If your goal is to have the Importer parser extracts the raw text from it like any other documents, this is not an issue, but it could be if you want to use the new document content as XML in a different context.

XML configuration usage:
```
<handler
    class="com.norconex.importer.handler.transformer.impl.DOMPreserveTransformer"
    parser="[html|xml]"
    sourceCharset="(character encoding)">
  
  <restrictTo>
    <fieldMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (field-matching expression)
    </fieldMatcher>
    <valueMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (value-matching expression)
    </valueMatcher>
  </restrictTo>
  
  <dom
      selector="(selector syntax)"
      extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]"
      matchBlanks="[false|true]"
      defaultValue="(optional value to use when no match)"/>
</handler>
```
XML usage example:
```
<handler
    class="DOMPreserveTransformer">
  <dom
      selector="div.firstName"
      extract="outerHtml"/>
  <dom
      selector="div.lastName"
      extract="outerHtml"/>
</handler>
```
Given this HTML snippet...
```
 <div>
   <div class="firstName">Joe</div>
   <div class="lastName">Dalton</div>
   <div class="city">Daisy Town</div>
 </div>
 
```
... the above example will result in the document content having the following:
```
   <div class="firstName">Joe</div>
   <div class="lastName">Dalton</div>
 
```
Since:

3.0.1

Author:

Pascal Essiembre

See Also:

DOMTagger, DOMDeleteTransformer

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class DOMPreserveTransformer.DOMExtractDetails
DOM Extraction Details

Nested Classes
Modifier and Type	Class and Description
`static class`	`DOMPreserveTransformer.DOMExtractDetails` DOM Extraction Details

Constructor Summary

Constructors
Constructor and Description

DOMPreserveTransformer()
Constructor.

Constructors
Constructor and Description
`DOMPreserveTransformer()` Constructor.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addDOMExtractDetails(DOMPreserveTransformer.DOMExtractDetails extractDetails)` Adds DOM extraction details.
`boolean`	`equals(Object other)`
`List<DOMPreserveTransformer.DOMExtractDetails>`	`getDOMExtractDetailsList()` Gets a list of DOM extraction details.
`String`	`getParser()` Gets the parser to use when creating the DOM-tree.
`String`	`getSourceCharset()` Gets the assumed source character encoding.
`int`	`hashCode()`
`protected void`	`loadHandlerFromXML(XML xml)` Loads configuration settings specific to the implementing class.
`void`	`removeDOMExtractDetails(String selector)` Removes the DOM extraction details matching the given selector
`void`	`removeDOMExtractDetailsList()` Removes all DOM extraction details.
`protected void`	`saveHandlerToXML(XML xml)` Saves configuration settings specific to the implementing class.
`void`	`setParser(String parser)` Sets the parser to use when creating the DOM-tree.
`void`	`setSourceCharset(String sourceCharset)` Sets the assumed source character encoding.
`String`	`toString()`
`protected void`	`transformApplicableDocument(HandlerDoc doc, InputStream document, OutputStream output, ParseState parseState)`

Methods inherited from class com.norconex.importer.handler.transformer.AbstractDocumentTransformer
transformDocument

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - DOMPreserveTransformer
```
public DOMPreserveTransformer()
```
    Constructor.
- Method Detail
  - getSourceCharset
```
public String getSourceCharset()
```
    Gets the assumed source character encoding.
    
    Returns:
    
    character encoding of the source to be transformed
  - setSourceCharset
```
public void setSourceCharset(String sourceCharset)
```
    Sets the assumed source character encoding.
    
    Parameters:
    
    sourceCharset - character encoding of the source to be transformed
  - getParser
```
public String getParser()
```
    Gets the parser to use when creating the DOM-tree.
    
    Returns:
    
    html (default) or xml.
  - setParser
```
public void setParser(String parser)
```
    Sets the parser to use when creating the DOM-tree.
    
    Parameters:
    
    parser - html or xml.
  - transformApplicableDocument
```
protected void transformApplicableDocument(HandlerDoc doc,
                                           InputStream document,
                                           OutputStream output,
                                           ParseState parseState)
                                    throws ImporterHandlerException
```
    Specified by:
    
    transformApplicableDocument in class AbstractDocumentTransformer
    
    Throws:
    
    ImporterHandlerException
  - addDOMExtractDetails
```
public void addDOMExtractDetails(DOMPreserveTransformer.DOMExtractDetails extractDetails)
```
    Adds DOM extraction details.
    
    Parameters:
    
    extractDetails - DOM extraction details
  - getDOMExtractDetailsList
```
public List<DOMPreserveTransformer.DOMExtractDetails> getDOMExtractDetailsList()
```
    Gets a list of DOM extraction details.
    
    Returns:
    
    list of DOM extraction details.
  - removeDOMExtractDetails
```
public void removeDOMExtractDetails(String selector)
```
    Removes the DOM extraction details matching the given selector
    
    Parameters:
    
    selector - DOM selector
  - removeDOMExtractDetailsList
```
public void removeDOMExtractDetailsList()
```
    Removes all DOM extraction details.
  - loadHandlerFromXML
```
protected void loadHandlerFromXML(XML xml)
```
    Description copied from class: AbstractImporterHandler
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadHandlerFromXML in class AbstractImporterHandler
    
    Parameters:
    
    xml - XML configuration
  - saveHandlerToXML
```
protected void saveHandlerToXML(XML xml)
```
    Description copied from class: AbstractImporterHandler
    
    Saves configuration settings specific to the implementing class.
    
    Specified by:
    
    saveHandlerToXML in class AbstractImporterHandler
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractImporterHandler
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractImporterHandler
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractImporterHandler

Class DOMPreserveTransformer

Content-types

Multiple preserved elements

XML configuration usage:

XML usage example:

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.transformer.AbstractDocumentTransformer

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler

Methods inherited from class java.lang.Object

Constructor Detail

DOMPreserveTransformer

Method Detail

getSourceCharset

setSourceCharset

getParser

setParser

transformApplicableDocument

addDOMExtractDetails

getDOMExtractDetailsList

removeDOMExtractDetails

removeDOMExtractDetailsList

loadHandlerFromXML

saveHandlerToXML

equals

hashCode

toString