Class AbstractTikaParser
- java.lang.Object
-
- com.norconex.importer.parser.impl.AbstractTikaParser
-
- All Implemented Interfaces:
IDocumentParser,IHintsAwareParser
- Direct Known Subclasses:
FallbackParser
public class AbstractTikaParser extends Object implements IHintsAwareParser
Base class wrapping Apache Tika parser for use by the importer.- Author:
- Pascal Essiembre
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected classAbstractTikaParser.MergeEmbeddedParserprotected static interfaceAbstractTikaParser.RecursiveParserprotected classAbstractTikaParser.SplitEmbbededParser
-
Constructor Summary
Constructors Constructor Description AbstractTikaParser(org.apache.tika.parser.Parser parser)Creates a new Tika-based parser.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description protected voidaddTikaMetadataToImporterMetadata(org.apache.tika.metadata.Metadata tikaMeta, Properties metadata)protected AbstractTikaParser.RecursiveParsercreateRecursiveParser(String reference, String contentType, Writer writer, Properties metadata, CachedStreamFactory streamFactory)booleanequals(Object other)OCRConfiggetOCRConfig()Deprecated.inthashCode()voidinitialize(ParseHints parserHints)Initialize this parser with the given parse hints.booleanisSplitEmbedded()Deprecated.protected voidmodifyParseContext(org.apache.tika.parser.ParseContext parseContext)Override to apply your own settings on the Tika ParseContext.List<Doc>parseDocument(Doc doc, Writer output)Parses a document.voidsetOCRConfig(OCRConfig ocrConfig)Deprecated.voidsetSplitEmbedded(boolean splitEmbedded)Deprecated.StringtoString()
-
-
-
Method Detail
-
initialize
public void initialize(ParseHints parserHints)
Description copied from interface:IHintsAwareParserInitialize this parser with the given parse hints. While not mandatory, aware parsers are strongly encouraged to support applicable hints.- Specified by:
initializein interfaceIHintsAwareParser- Parameters:
parserHints- configuration settings influencing parsing when possible or appropriate
-
parseDocument
public final List<Doc> parseDocument(Doc doc, Writer output) throws DocumentParserException
Description copied from interface:IDocumentParserParses a document.- Specified by:
parseDocumentin interfaceIDocumentParser- Parameters:
doc- importer document to parseoutput- where to store extracted or modified content of the supplied document- Returns:
- a list of first-level embedded documents, if any
- Throws:
DocumentParserException- problem parsing document
-
modifyParseContext
protected void modifyParseContext(org.apache.tika.parser.ParseContext parseContext)
Override to apply your own settings on the Tika ParseContext. The ParseContext is already configured before calling this method. Changing existing settings may cause failure. Only override if you know what you are doing. The default implementation does nothing.- Parameters:
parseContext- Tika parse context
-
addTikaMetadataToImporterMetadata
protected void addTikaMetadataToImporterMetadata(org.apache.tika.metadata.Metadata tikaMeta, Properties metadata)
-
createRecursiveParser
protected AbstractTikaParser.RecursiveParser createRecursiveParser(String reference, String contentType, Writer writer, Properties metadata, CachedStreamFactory streamFactory)
-
setOCRConfig
@Deprecated public void setOCRConfig(OCRConfig ocrConfig)
Deprecated.Sets the OCR configuration.- Parameters:
ocrConfig- the ocrConfig to set- Since:
- 2.1.0
-
getOCRConfig
@Deprecated public OCRConfig getOCRConfig()
Deprecated.Gets the OCR configuration (never null).- Returns:
- the OCR configuration
- Since:
- 2.1.0
-
isSplitEmbedded
@Deprecated public boolean isSplitEmbedded()
Deprecated.Gets whether embedded documents should be split to become "standalone" distinct documents.- Returns:
trueif parser should split embedded documents.
-
setSplitEmbedded
@Deprecated public void setSplitEmbedded(boolean splitEmbedded)
Deprecated.Sets whether embedded documents should be split to become "standalone" distinct documents.- Parameters:
splitEmbedded-trueif parser should split embedded documents.
-
-