DOMDeleteTransformer (Norconex Importer 3.0.1 API)

java.lang.Object
- com.norconex.importer.handler.AbstractImporterHandler
- - com.norconex.importer.handler.transformer.AbstractDocumentTransformer
  - - com.norconex.importer.handler.transformer.impl.DOMDeleteTransformer

All Implemented Interfaces:

IXMLConfigurable, IImporterHandler, IDocumentTransformer
```
public class DOMDeleteTransformer
extends AbstractDocumentTransformer
```
Enables deletion of one or more elements matching a given selector from a document content. Applies to HTML, XHTML, or XML document. To deal with DOM elements in metadata fields, use DOMTagger instead.

This class constructs a DOM tree from the document content. That DOM tree is loaded entirely into memory. Use this transformer with caution if you know you'll need to parse huge files. It may be preferable to use ReplaceTransformer if this is a concern. Also, to help performance and avoid re-creating DOM tree before every DOM operations you want to perform, try to combine multiple extractions in a single instance of this transformer.

The jsoup parser library is used to load a document content into a DOM tree. Elements are referenced using a CSS or JQuery-like syntax.

Should be used as a pre-parse handler.

Content-types

By default, this filter is restricted to (applies only to) documents matching the restrictions returned by CommonRestrictions.domContentTypes(String). You can specify your own content types if you know they represent a file with HTML or XML-like markup tags.

When used as a pre-parse handler, this class attempts to detect the content character encoding unless the character encoding was specified using setSourceCharset(String).

You can specify which parser to use when reading documents. The default is "html" and will normalize the content as HTML. This is generally a desired behavior, but this can sometimes have your selector fail. If you encounter this problem, try switching to "xml" parser, which does not attempt normalization on the content. The drawback with "xml" is you may not get all HTML-specific selector options to work. If you know you are dealing with XML to begin with, specifying "xml" should be a good option.

XML configuration usage:
```
<handler
    class="com.norconex.importer.handler.transformer.impl.DOMDeleteTransformer"
    parser="[html|xml]"
    sourceCharset="(character encoding)">
  
  <restrictTo>
    <fieldMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (field-matching expression)
    </fieldMatcher>
    <valueMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (value-matching expression)
    </valueMatcher>
  </restrictTo>
  
  <dom
      selector="(selector syntax)"/>
</handler>
```
XML usage example:
```
<handler
    class="DOMDeleteTransformer">
  <dom
      selector="div.firstName"/>
</handler>
```
Given this HTML snippet...
```
 <div class="firstName">Joe</div>
 <div class="lastName">Dalton</div>
 
```
... the above example will delete "Joe" but keep "Dalton".
Since:

3.0.0

Author:

Pascal Essiembre

See Also:

DOMTagger, DOMPreserveTransformer

Constructor Summary

Constructors
Constructor and Description

DOMDeleteTransformer()
Constructor.

Constructors
Constructor and Description
`DOMDeleteTransformer()` Constructor.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addSelector(String selector)`
`boolean`	`equals(Object other)`
`String`	`getParser()` Gets the parser to use when creating the DOM-tree.
`List<String>`	`getSelectors()`
`String`	`getSourceCharset()` Gets the assumed source character encoding.
`int`	`hashCode()`
`protected void`	`loadHandlerFromXML(XML xml)` Loads configuration settings specific to the implementing class.
`protected void`	`saveHandlerToXML(XML xml)` Saves configuration settings specific to the implementing class.
`void`	`setParser(String parser)` Sets the parser to use when creating the DOM-tree.
`void`	`setSelectors(List<String> selectors)`
`void`	`setSourceCharset(String sourceCharset)` Sets the assumed source character encoding.
`String`	`toString()`
`protected void`	`transformApplicableDocument(HandlerDoc doc, InputStream document, OutputStream output, ParseState parseState)`

Methods inherited from class com.norconex.importer.handler.transformer.AbstractDocumentTransformer
transformDocument

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - DOMDeleteTransformer
```
public DOMDeleteTransformer()
```
    Constructor.
- Method Detail
  - getSourceCharset
```
public String getSourceCharset()
```
    Gets the assumed source character encoding.
    
    Returns:
    
    character encoding of the source to be transformed
  - setSourceCharset
```
public void setSourceCharset(String sourceCharset)
```
    Sets the assumed source character encoding.
    
    Parameters:
    
    sourceCharset - character encoding of the source to be transformed
  - getParser
```
public String getParser()
```
    Gets the parser to use when creating the DOM-tree.
    
    Returns:
    
    html (default) or xml.
  - setParser
```
public void setParser(String parser)
```
    Sets the parser to use when creating the DOM-tree.
    
    Parameters:
    
    parser - html or xml.
  - transformApplicableDocument
```
protected void transformApplicableDocument(HandlerDoc doc,
                                           InputStream document,
                                           OutputStream output,
                                           ParseState parseState)
                                    throws ImporterHandlerException
```
    Specified by:
    
    transformApplicableDocument in class AbstractDocumentTransformer
    
    Throws:
    
    ImporterHandlerException
  - getSelectors
```
public List<String> getSelectors()
```
  - setSelectors
```
public void setSelectors(List<String> selectors)
```
  - addSelector
```
public void addSelector(String selector)
```
  - loadHandlerFromXML
```
protected void loadHandlerFromXML(XML xml)
```
    Description copied from class: AbstractImporterHandler
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadHandlerFromXML in class AbstractImporterHandler
    
    Parameters:
    
    xml - XML configuration
  - saveHandlerToXML
```
protected void saveHandlerToXML(XML xml)
```
    Description copied from class: AbstractImporterHandler
    
    Saves configuration settings specific to the implementing class.
    
    Specified by:
    
    saveHandlerToXML in class AbstractImporterHandler
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractImporterHandler
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractImporterHandler
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractImporterHandler

Class DOMDeleteTransformer

Content-types

XML configuration usage:

XML usage example:

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.transformer.AbstractDocumentTransformer

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler

Methods inherited from class java.lang.Object

Constructor Detail

DOMDeleteTransformer

Method Detail

getSourceCharset

setSourceCharset

getParser

setParser

transformApplicableDocument

getSelectors

setSelectors

addSelector

loadHandlerFromXML

saveHandlerToXML

equals

hashCode

toString