public class GenericDocumentParserFactory extends Object implements IDocumentParserFactory, IXMLConfigurable
Generic document parser factory. It uses Apacke Tika for most of its supported content types. For unknown content types, it falls back to Tika generic media detector/parser.
As of 2.6.0, it is possible to register your own parsers.
You can "ignore" content-types so they do not get parsed. Unparsed documents will be sent as is to the post handlers and the calling application. Use caution when using that feature since post-parsing handlers (or applications) usually expect text-only content for them to execute properly. Unless you really know what you are doing, avoid excluding binary content types from parsing.
Parsing a document also attempts to detect the character encoding (charset) of the extracted text to converts it to UTF-8. When ignoring content-types, the character encoding conversion to UTF-8 cannot take place and your documents will likely retain their original encoding.
For documents containing embedded documents (e.g. zip files), the default
behavior of this treat them as a single document, merging all
embedded documents content and metadata into the parent document.
You can tell this parser to "split" embedded
documents to have them treated as if they were individual documents. When
split, each embedded documents will go through the entire import cycle,
going through your handlers and even this parser again
(just like any regular document would). The resulting
ImporterResponse
should then contain nested documents, which in turn,
might contain some (tree-like structure). As of 2.6.0, this is enabled by
specifying a regular expression to match content types of container
documents you want to "split".
In addition, since 2.6.0 you can control which embedded documents you do not want extracted from their containers, as well as which documents containers you do not want to extract their embedded documents.
You can configure this parser to use the Tesseract open-source OCR application to extract text out of images or documents containing embedded images (e.g. PDF). Supported image formats are TIFF, PNG, JPEG, GIF, and BMP.
To enable this feature, you must first download and install a copy of Tesseract appropriate for your platform (supported are Linux, Windows, Mac and other platforms). It will only be activated once you configure the path to its install location. Default language detection is for English. To support additional or different languages, you can provide a list of three-letter ISO language codes supported by Tesseract. These languages must be part of your Tesseract installation. You can download additional languages form the Tesseract web site.
When enabled, OCR is attempted on all supported image formats. To limit OCR to a subset of document content types, configure the corresponding content-types (e.g. application/pdf, image/tiff, image/png, etc.).
<documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory"> <ocr path="(path to Tesseract OCR software executable)"> <languages> (optional coma-separated list of Tesseract languages) </languages> <contentTypes> (optional regex matching content types to limit OCR on) </contentTypes> </ocr> <ignoredContentTypes> (optional regex matching content types to ignore for parsing, i.e., not parsed) </ignoredContentTypes> <embedded> <splitContentTypes> (optional regex matching content types of containing files you want to "split" and have their embedded documents treated as individual documents) </splitContentTypes> <noExtractEmbeddedContentTypes> (optional regex matching content types of embedded files you do not want to extract from containing documents, regardless of the container content type) </noExtractEmbeddedContentTypes> <noExtractContainerContentTypes> (optional regex matching content types of containing files you do not want to see their embedded files extracted, regardless of the embedded content types) </noExtractContainerContentTypes> </embedded> <fallbackParser class="(optionally overwrite the fallback parser)" /> <parsers> <!-- Optionally overwrite default parsers. You can configure many parsers. --> <parser contentType="(content type)" class="(IDocumentParser implementing class)" /> </parsers> </documentParserFactory>
The following uses Tesseract to convert English and French images in PDF into text and it will also extract documents from Zip files and treat them as separate documents.
<documentParserFactory> <ocr path="/app/ocr/tesseract.exe"> <languages>en, fr</languages> <contentTypes>application/pdf</contentTypes> </ocr> <embedded> <splitContentTypes>application/zip</splitContentTypes> </embedded> </documentParserFactory>
Constructor and Description |
---|
GenericDocumentParserFactory()
Creates a new document parser factory of the given format.
|
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getIgnoredContentTypesRegex()
Gets the regular expression matching content types to ignore
(i.e.
|
OCRConfig |
getOCRConfig()
Deprecated.
Since 2.6.0, use
getParseHints() |
ParseHints |
getParseHints()
Gets parse hints.
|
IDocumentParser |
getParser(String documentReference,
ContentType contentType)
Gets a parser based on content type, regardless of document reference
(ignoring it).
|
int |
hashCode() |
protected void |
initDefaultParsers() |
boolean |
isSplitEmbedded()
Deprecated.
Since 2.6.0, use
getParseHints() |
void |
loadFromXML(Reader in) |
void |
registerParser(ContentType contentType,
IDocumentParser parser)
Registers a parser to use for the given content type.
|
void |
saveToXML(Writer out) |
void |
setIgnoredContentTypesRegex(String ignoredContentTypesRegex)
sets the regular expression matching content types to ignore
(i.e.
|
void |
setOCRConfig(OCRConfig ocrConfig)
Deprecated.
Since 2.6.0, use
getParseHints() |
void |
setSplitEmbedded(boolean splitEmbedded)
Deprecated.
Since 2.6.0, use
getParseHints() |
String |
toString() |
public GenericDocumentParserFactory()
protected void initDefaultParsers()
public ParseHints getParseHints()
public void registerParser(ContentType contentType, IDocumentParser parser)
getIgnoredContentTypesRegex()
.contentType
- content typeparser
- parserpublic final IDocumentParser getParser(String documentReference, ContentType contentType)
getParser
in interface IDocumentParserFactory
documentReference
- document referencecontentType
- content typepublic String getIgnoredContentTypesRegex()
public void setIgnoredContentTypesRegex(String ignoredContentTypesRegex)
ignoredContentTypesRegex
- regular expressionpublic void loadFromXML(Reader in) throws IOException
loadFromXML
in interface IXMLConfigurable
IOException
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
@Deprecated public OCRConfig getOCRConfig()
getParseHints()
@Deprecated public void setOCRConfig(OCRConfig ocrConfig)
getParseHints()
ocrConfig
- the ocrConfig to set@Deprecated public boolean isSplitEmbedded()
getParseHints()
true
if parser should split embedded documents.@Deprecated public void setSplitEmbedded(boolean splitEmbedded)
getParseHints()
splitEmbedded
- true
if parser should split
embedded documents.Copyright © 2009–2021 Norconex Inc.. All rights reserved.