Class GenericDocumentParserFactory

  • All Implemented Interfaces:
    IXMLConfigurable, IDocumentParserFactory

    public class GenericDocumentParserFactory
    extends Object
    implements IDocumentParserFactory, IXMLConfigurable

    Generic document parser factory. It uses Apacke Tika for most of its supported content types. For unknown content types, it falls back to Tika generic media detector/parser.

    As of 2.6.0, it is possible to register your own parsers.

    Ignoring content types:

    You can "ignore" content-types so they do not get parsed. Unparsed documents will be sent as is to the post handlers and the calling application. Use caution when using that feature since post-parsing handlers (or applications) usually expect text-only content for them to execute properly. Unless you really know what you are doing, avoid excluding binary content types from parsing.

    Character encoding:

    Parsing a document also attempts to detect the character encoding (charset) of the extracted text to converts it to UTF-8. When ignoring content-types, the character encoding conversion to UTF-8 cannot take place and your documents will likely retain their original encoding.

    Embedded documents:

    For documents containing embedded documents (e.g. zip files), the default behavior of this treat them as a single document, merging all embedded documents content and metadata into the parent document. You can tell this parser to "split" embedded documents to have them treated as if they were individual documents. When split, each embedded documents will go through the entire import cycle, going through your handlers and even this parser again (just like any regular document would). The resulting ImporterResponse should then contain nested documents, which in turn, might contain some (tree-like structure). As of 2.6.0, this is enabled by specifying a regular expression to match content types of container documents you want to "split".

    In addition, since 2.6.0 you can control which embedded documents you do not want extracted from their containers, as well as which documents containers you do not want to extract their embedded documents.

    Optical character recognition (OCR):

    You can configure this parser to use the Tesseract open-source OCR application to extract text out of images or documents containing embedded images (e.g. PDF). Supported image formats are TIFF, PNG, JPEG, GIF, and BMP.

    To enable this feature, you must first download and install a copy of Tesseract appropriate for your platform (supported are Linux, Windows, Mac and other platforms). It will only be activated once you configure the path to its install location. Default language detection is for English. To support additional or different languages, you can provide a list of three-letter ISO language codes supported by Tesseract. These languages must be part of your Tesseract installation. You can download additional languages form the Tesseract web site.

    When enabled, OCR is attempted on all supported image formats. To limit OCR to a subset of document content types, configure the corresponding content-types (e.g. application/pdf, image/tiff, image/png, etc.).

    XML configuration usage:

    
    <documentParserFactory
        class="com.norconex.importer.parser.GenericDocumentParserFactory">
      <ocr
          path="(path to Tesseract OCR software executable)">
        <languages>(optional coma-separated list of Tesseract languages)</languages>
        <contentTypes>
          (optional regex matching content types to limit OCR on)
        </contentTypes>
      </ocr>
      <ignoredContentTypes>
        (optional regex matching content types to ignore for parsing,
         i.e., not parsed)
      </ignoredContentTypes>
      <embedded>
        <splitContentTypes>
          (optional regex matching content types of containing files
           you want to "split" and have their embedded documents
           treated as individual documents)
        </splitContentTypes>
        <noExtractEmbeddedContentTypes>
          (optional regex matching content types of embedded files you do
           not want to extract from containing documents, regardless of
    
           the container content type)
        </noExtractEmbeddedContentTypes>
        <noExtractContainerContentTypes>
          (optional regex matching content types of containing files you
           do not want to see their embedded files extracted, regardless
    
           of the embedded content types)
        </noExtractContainerContentTypes>
      </embedded>
      <fallbackParser
          class="(optionally overwrite the fallback parser)"/>
      <parsers>
        <!--
          Optionally overwrite default parsers.
                        You can configure many parsers.
          -->
        <parser
            contentType="(content type)"
            class="(IDocumentParser implementing class)"/>
      </parsers>
    </documentParserFactory>

    Usage example:

    The following uses Tesseract to convert English and French images in PDF into text and it will also extract documents from Zip files and treat them as separate documents.

    XML usage example:

    
    <documentParserFactory>
      <ocr
          path="/app/ocr/tesseract.exe">
        <languages>en, fr</languages>
        <contentTypes>application/pdf</contentTypes>
      </ocr>
      <embedded>
        <splitContentTypes>application/zip</splitContentTypes>
      </embedded>
    </documentParserFactory>
    Author:
    Pascal Essiembre
    • Constructor Detail

      • GenericDocumentParserFactory

        public GenericDocumentParserFactory()
        Creates a new document parser factory of the given format.
    • Method Detail

      • initDefaultParsers

        protected void initDefaultParsers()
      • getParseHints

        public ParseHints getParseHints()
        Gets parse hints.
        Returns:
        parse hints
        Since:
        2.6.0
      • registerParser

        public void registerParser​(ContentType contentType,
                                   IDocumentParser parser)
        Registers a parser to use for the given content type. The provided parser will never be used if the content type is ignored by getIgnoredContentTypesRegex().
        Parameters:
        contentType - content type
        parser - parser
        Since:
        2.6.0
      • getParser

        public final IDocumentParser getParser​(String documentReference,
                                               ContentType contentType)
        Gets a parser based on content type, regardless of document reference (ignoring it). All parsers are assumed to have been configured properly before the first call to this method.
        Specified by:
        getParser in interface IDocumentParserFactory
        Parameters:
        documentReference - document reference
        contentType - content type
        Returns:
        document parser
      • getIgnoredContentTypesRegex

        public String getIgnoredContentTypesRegex()
        Gets the regular expression matching content types to ignore (i.e. do not perform parsing on them).
        Returns:
        regular expression
      • setIgnoredContentTypesRegex

        public void setIgnoredContentTypesRegex​(String ignoredContentTypesRegex)
        sets the regular expression matching content types to ignore (i.e. do not perform parsing on them).
        Parameters:
        ignoredContentTypesRegex - regular expression
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class Object