java.lang.Object
- com.norconex.importer.parser.GenericDocumentParserFactory

All Implemented Interfaces:

IXMLConfigurable, IDocumentParserFactory
```
public class GenericDocumentParserFactory
extends Object
implements IDocumentParserFactory, IXMLConfigurable
```
Generic document parser factory. It uses Apacke Tika for most of its supported content types. For unknown content types, it falls back to Tika generic media detector/parser.

As of 2.6.0, it is possible to register your own parsers.

Ignoring content types:

You can "ignore" content-types so they do not get parsed. Unparsed documents will be sent as is to the post handlers and the calling application. Use caution when using that feature since post-parsing handlers (or applications) usually expect text-only content for them to execute properly. Unless you really know what you are doing, avoid excluding binary content types from parsing.

Character encoding:

Parsing a document also attempts to detect the character encoding (charset) of the extracted text to converts it to UTF-8. When ignoring content-types, the character encoding conversion to UTF-8 cannot take place and your documents will likely retain their original encoding.

Embedded documents:

For documents containing embedded documents (e.g. zip files), the default behavior of this treat them as a single document, merging all embedded documents content and metadata into the parent document. You can tell this parser to "split" embedded documents to have them treated as if they were individual documents. When split, each embedded documents will go through the entire import cycle, going through your handlers and even this parser again (just like any regular document would). The resulting ImporterResponse should then contain nested documents, which in turn, might contain some (tree-like structure). As of 2.6.0, this is enabled by specifying a regular expression to match content types of container documents you want to "split".

In addition, since 2.6.0 you can control which embedded documents you do not want extracted from their containers, as well as which documents containers you do not want to extract their embedded documents.

Optical character recognition (OCR):

You can configure this parser to use the Tesseract open-source OCR application to extract text out of images or documents containing embedded images (e.g. PDF). Supported image formats are TIFF, PNG, JPEG, GIF, and BMP.

To enable this feature, you must first download and install a copy of Tesseract appropriate for your platform (supported are Linux, Windows, Mac and other platforms). It will only be activated once you configure the path to its install location. Default language detection is for English. To support additional or different languages, you can provide a list of three-letter ISO language codes supported by Tesseract. These languages must be part of your Tesseract installation. You can download additional languages form the Tesseract web site.

When enabled, OCR is attempted on all supported image formats. To limit OCR to a subset of document content types, configure the corresponding content-types (e.g. application/pdf, image/tiff, image/png, etc.).

XML configuration usage:
```
<documentParserFactory
    class="com.norconex.importer.parser.GenericDocumentParserFactory">
  <ocr
      path="(path to Tesseract OCR software executable)">
    <languages>(optional coma-separated list of Tesseract languages)</languages>
    <contentTypes>
      (optional regex matching content types to limit OCR on)
    </contentTypes>
  </ocr>
  <ignoredContentTypes>
    (optional regex matching content types to ignore for parsing,
     i.e., not parsed)
  </ignoredContentTypes>
  <embedded>
    <splitContentTypes>
      (optional regex matching content types of containing files
       you want to "split" and have their embedded documents
       treated as individual documents)
    </splitContentTypes>
    <noExtractEmbeddedContentTypes>
      (optional regex matching content types of embedded files you do
       not want to extract from containing documents, regardless of

       the container content type)
    </noExtractEmbeddedContentTypes>
    <noExtractContainerContentTypes>
      (optional regex matching content types of containing files you
       do not want to see their embedded files extracted, regardless

       of the embedded content types)
    </noExtractContainerContentTypes>
  </embedded>
  <fallbackParser
      class="(optionally overwrite the fallback parser)"/>
  <parsers>
    
    <parser
        contentType="(content type)"
        class="(IDocumentParser implementing class)"/>
  </parsers>
</documentParserFactory>
```
Usage example:

The following uses Tesseract to convert English and French images in PDF into text and it will also extract documents from Zip files and treat them as separate documents.

XML usage example:
```
<documentParserFactory>
  <ocr
      path="/app/ocr/tesseract.exe">
    <languages>en, fr</languages>
    <contentTypes>application/pdf</contentTypes>
  </ocr>
  <embedded>
    <splitContentTypes>application/zip</splitContentTypes>
  </embedded>
</documentParserFactory>
```
Author:

Pascal Essiembre

Constructor Summary

Constructors
Constructor Description

GenericDocumentParserFactory()
Creates a new document parser factory of the given format.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`boolean`	`equals(Object other)`
`String`	`getIgnoredContentTypesRegex()`	Gets the regular expression matching content types to ignore (i.e. do not perform parsing on them).
`ParseHints`	`getParseHints()`	Gets parse hints.
`IDocumentParser`	`getParser(String documentReference, ContentType contentType)`	Gets a parser based on content type, regardless of document reference (ignoring it).
`int`	`hashCode()`
`protected void`	`initDefaultParsers()`
`void`	`loadFromXML(XML xml)`
`void`	`registerParser(ContentType contentType, IDocumentParser parser)`	Registers a parser to use for the given content type.
`void`	`saveToXML(XML xml)`
`void`	`setIgnoredContentTypesRegex(String ignoredContentTypesRegex)`	sets the regular expression matching content types to ignore (i.e. do not perform parsing on them).
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - GenericDocumentParserFactory
```
public GenericDocumentParserFactory()
```
    Creates a new document parser factory of the given format.
- Method Detail
  - initDefaultParsers
```
protected void initDefaultParsers()
```
  - getParseHints
```
public ParseHints getParseHints()
```
    Gets parse hints.
    
    Returns:
    
    parse hints
    
    Since:
    
    2.6.0
  - registerParser
```
public void registerParser(ContentType contentType,
                           IDocumentParser parser)
```
    Registers a parser to use for the given content type. The provided parser will never be used if the content type is ignored by getIgnoredContentTypesRegex().
    
    Parameters:
    
    contentType - content type
    
    parser - parser
    
    Since:
    
    2.6.0
  - getParser
```
public final IDocumentParser getParser(String documentReference,
                                       ContentType contentType)
```
    Gets a parser based on content type, regardless of document reference (ignoring it). All parsers are assumed to have been configured properly before the first call to this method.
    
    Specified by:
    
    getParser in interface IDocumentParserFactory
    
    Parameters:
    
    documentReference - document reference
    
    contentType - content type
    
    Returns:
    
    document parser
  - getIgnoredContentTypesRegex
```
public String getIgnoredContentTypesRegex()
```
    Gets the regular expression matching content types to ignore (i.e. do not perform parsing on them).
    
    Returns:
    
    regular expression
  - setIgnoredContentTypesRegex
```
public void setIgnoredContentTypesRegex(String ignoredContentTypesRegex)
```
    sets the regular expression matching content types to ignore (i.e. do not perform parsing on them).
    
    Parameters:
    
    ignoredContentTypesRegex - regular expression
  - loadFromXML
```
public void loadFromXML(XML xml)
```
    Specified by:
    
    loadFromXML in interface IXMLConfigurable
  - saveToXML
```
public void saveToXML(XML xml)
```
    Specified by:
    
    saveToXML in interface IXMLConfigurable
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class Object
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class Object
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object

Class GenericDocumentParserFactory

Ignoring content types:

Character encoding:

Embedded documents:

Optical character recognition (OCR):

XML configuration usage:

Usage example:

XML usage example:

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

GenericDocumentParserFactory

Method Detail

initDefaultParsers

getParseHints

registerParser

getParser

getIgnoredContentTypesRegex

setIgnoredContentTypesRegex

loadFromXML

saveToXML

equals

hashCode

toString