Class AbstractTikaParser
- java.lang.Object
-
- com.norconex.importer.parser.impl.AbstractTikaParser
-
- All Implemented Interfaces:
IDocumentParser
,IHintsAwareParser
- Direct Known Subclasses:
FallbackParser
public class AbstractTikaParser extends Object implements IHintsAwareParser
Base class wrapping Apache Tika parser for use by the importer.- Author:
- Pascal Essiembre
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected class
AbstractTikaParser.MergeEmbeddedParser
protected static interface
AbstractTikaParser.RecursiveParser
protected class
AbstractTikaParser.SplitEmbbededParser
-
Constructor Summary
Constructors Constructor Description AbstractTikaParser(org.apache.tika.parser.Parser parser)
Creates a new Tika-based parser.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description protected void
addTikaMetadataToImporterMetadata(org.apache.tika.metadata.Metadata tikaMeta, Properties metadata)
protected AbstractTikaParser.RecursiveParser
createRecursiveParser(String reference, String contentType, Writer writer, Properties metadata, CachedStreamFactory streamFactory)
boolean
equals(Object other)
OCRConfig
getOCRConfig()
Deprecated.int
hashCode()
void
initialize(ParseHints parserHints)
Initialize this parser with the given parse hints.boolean
isSplitEmbedded()
Deprecated.protected void
modifyParseContext(org.apache.tika.parser.ParseContext parseContext)
Override to apply your own settings on the Tika ParseContext.List<Doc>
parseDocument(Doc doc, Writer output)
Parses a document.void
setOCRConfig(OCRConfig ocrConfig)
Deprecated.void
setSplitEmbedded(boolean splitEmbedded)
Deprecated.String
toString()
-
-
-
Method Detail
-
initialize
public void initialize(ParseHints parserHints)
Description copied from interface:IHintsAwareParser
Initialize this parser with the given parse hints. While not mandatory, aware parsers are strongly encouraged to support applicable hints.- Specified by:
initialize
in interfaceIHintsAwareParser
- Parameters:
parserHints
- configuration settings influencing parsing when possible or appropriate
-
parseDocument
public final List<Doc> parseDocument(Doc doc, Writer output) throws DocumentParserException
Description copied from interface:IDocumentParser
Parses a document.- Specified by:
parseDocument
in interfaceIDocumentParser
- Parameters:
doc
- importer document to parseoutput
- where to store extracted or modified content of the supplied document- Returns:
- a list of first-level embedded documents, if any
- Throws:
DocumentParserException
- problem parsing document
-
modifyParseContext
protected void modifyParseContext(org.apache.tika.parser.ParseContext parseContext)
Override to apply your own settings on the Tika ParseContext. The ParseContext is already configured before calling this method. Changing existing settings may cause failure. Only override if you know what you are doing. The default implementation does nothing.- Parameters:
parseContext
- Tika parse context
-
addTikaMetadataToImporterMetadata
protected void addTikaMetadataToImporterMetadata(org.apache.tika.metadata.Metadata tikaMeta, Properties metadata)
-
createRecursiveParser
protected AbstractTikaParser.RecursiveParser createRecursiveParser(String reference, String contentType, Writer writer, Properties metadata, CachedStreamFactory streamFactory)
-
setOCRConfig
@Deprecated public void setOCRConfig(OCRConfig ocrConfig)
Deprecated.Sets the OCR configuration.- Parameters:
ocrConfig
- the ocrConfig to set- Since:
- 2.1.0
-
getOCRConfig
@Deprecated public OCRConfig getOCRConfig()
Deprecated.Gets the OCR configuration (never null).- Returns:
- the OCR configuration
- Since:
- 2.1.0
-
isSplitEmbedded
@Deprecated public boolean isSplitEmbedded()
Deprecated.Gets whether embedded documents should be split to become "standalone" distinct documents.- Returns:
true
if parser should split embedded documents.
-
setSplitEmbedded
@Deprecated public void setSplitEmbedded(boolean splitEmbedded)
Deprecated.Sets whether embedded documents should be split to become "standalone" distinct documents.- Parameters:
splitEmbedded
-true
if parser should split embedded documents.
-
-