public class ExternalParser extends Object implements IDocumentParser, IXMLConfigurable
Parses and extracts text from a file using an external application to do so.
This parser relies heavily on the mechanics of
ExternalTransformer
. Refer to that class for documentation.
This parser can be made configurable via XML. See
GenericDocumentParserFactory
for general indications how
to configure parsers.
To use an external application to change a file content after parsing has
already occurred, consider using ExternalTransformer
instead.
<parser contentType="(content type this parser is associated to)" class="com.norconex.importer.parser.impl.ExternalParser" > <command> c:\Apps\myapp.exe ${INPUT} ${OUTPUT} ${INPUT_META} ${OUTPUT_META} ${REFERENCE} </command> <metadata inputFormat="[json|xml|properties]" outputFormat="[json|xml|properties]"> <!-- pattern only used when no output format is specified --> <pattern field="(target field name)" fieldGroup="(field name match group index)" valueGroup="(field value match group index)" caseSensitive="[false|true]"> (regular expression) </pattern> <!-- repeat pattern tag as needed --> </metadata> <environment> <variable name="(environment variable name)"> (environment variable value) </variable> <!-- repeat variable tag as needed --> </environment> </parser>
The following example invokes an external application processing for simple text files that accepts two files as arguments: the first one being the file to transform, the second one being holding the transformation result. It also extract a document number from STDOUT, found as "DocNo:1234" and storing it as "docnumber".
<parser contentType="text/plain" class="com.norconex.importer.parser.impl.ExternalParser" > <command>/path/transform/app ${INPUT} ${OUTPUT}</command> <metadata> <pattern field="docnumber" valueGroup="1">DocNo:(\d+)</match> </metadata> </parser>
ExternalTransformer
Modifier and Type | Field and Description |
---|---|
static String |
META_FORMAT_JSON |
static String |
META_FORMAT_PROPERTIES |
static String |
META_FORMAT_XML |
static String |
TOKEN_INPUT |
static String |
TOKEN_INPUT_META |
static String |
TOKEN_OUTPUT |
static String |
TOKEN_OUTPUT_META |
static String |
TOKEN_REFERENCE |
Constructor and Description |
---|
ExternalParser() |
Modifier and Type | Method and Description |
---|---|
void |
addEnvironmentVariable(String name,
String value)
Adds an environment variables to the list of previously
assigned variables (if any).
|
void |
addEnvironmentVariables(Map<String,String> environmentVariables)
Adds the environment variables, keeping environment variables previously
assigned.
|
void |
addMetadataExtractionPattern(Pattern pattern,
boolean reverse)
Deprecated.
Since 2.8.0, use
addMetadataExtractionPatterns(RegexFieldExtractor...) |
void |
addMetadataExtractionPattern(Pattern pattern,
String field)
Deprecated.
Since 2.8.0, use
addMetadataExtractionPatterns(RegexFieldExtractor...) |
void |
addMetadataExtractionPattern(String field,
String pattern)
Adds a metadata extraction pattern that will extract the whole text
matched into the given field.
|
void |
addMetadataExtractionPattern(String field,
String pattern,
int valueGroup)
Adds a metadata extraction pattern, which will extract the value from
the specified group index upon matching.
|
void |
addMetadataExtractionPatterns(Map<Pattern,String> patterns)
Deprecated.
Since 2.8.0, use
addMetadataExtractionPatterns(RegexFieldExtractor...) |
void |
addMetadataExtractionPatterns(RegexFieldExtractor... patterns)
Adds a metadata extraction pattern that will extract matching field
names/values.
|
boolean |
equals(Object other) |
String |
getCommand()
Gets the command to execute.
|
Map<String,String> |
getEnvironmentVariables()
Gets environment variables.
|
List<RegexFieldExtractor> |
getMetadataExtractionPatterns()
Gets metadata extraction patterns.
|
String |
getMetadataInputFormat()
Gets the format of the metadata input file sent to the external
application.
|
String |
getMetadataOutputFormat()
Gets the format of the metadata output file from the external
application.
|
File |
getTempDir()
Gets directory where to store temporary files used for transformation.
|
int |
hashCode() |
void |
loadFromXML(Reader in) |
List<ImporterDocument> |
parseDocument(ImporterDocument doc,
Writer output)
Parses a document.
|
void |
saveToXML(Writer out) |
void |
setCommand(String command)
Sets the command to execute.
|
void |
setEnvironmentVariables(Map<String,String> environmentVariables)
Sets the environment variables.
|
void |
setMetadataExtractionPatterns(Map<Pattern,String> patterns)
Deprecated.
Since 2.8.0, use
addMetadataExtractionPatterns(RegexFieldExtractor...) |
void |
setMetadataExtractionPatterns(RegexFieldExtractor... patterns)
Sets metadata extraction patterns.
|
void |
setMetadataInputFormat(String metadataInputFormat)
Sets the format of the metadata input file sent to the external
application.
|
void |
setMetadataOutputFormat(String metadataOutputFormat)
Sets the format of the metadata output file from the external
application.
|
void |
setTempDir(File tempDir)
Sets directory where to store temporary files used for transformation.
|
String |
toString() |
public static final String TOKEN_INPUT
public static final String TOKEN_OUTPUT
public static final String TOKEN_INPUT_META
public static final String TOKEN_OUTPUT_META
public static final String TOKEN_REFERENCE
public static final String META_FORMAT_JSON
public static final String META_FORMAT_XML
public static final String META_FORMAT_PROPERTIES
public String getCommand()
public void setCommand(String command)
command
- the commandpublic File getTempDir()
public void setTempDir(File tempDir)
tempDir
- temporary directorypublic List<RegexFieldExtractor> getMetadataExtractionPatterns()
@Deprecated public void setMetadataExtractionPatterns(Map<Pattern,String> patterns)
addMetadataExtractionPatterns(RegexFieldExtractor...)
patterns
- map of patterns and field names@Deprecated public void addMetadataExtractionPatterns(Map<Pattern,String> patterns)
addMetadataExtractionPatterns(RegexFieldExtractor...)
patterns
- map of patterns and field names@Deprecated public void addMetadataExtractionPattern(Pattern pattern, boolean reverse)
addMetadataExtractionPatterns(RegexFieldExtractor...)
pattern
- pattern with two match groupsreverse
- whether to reverse match groups (inverse key and value).@Deprecated public void addMetadataExtractionPattern(Pattern pattern, String field)
addMetadataExtractionPatterns(RegexFieldExtractor...)
pattern
- pattern with no or one match groupfield
- field name where to store the matched patternpublic void addMetadataExtractionPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addMetadataExtractionPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addMetadataExtractionPatterns(RegexFieldExtractor... patterns)
patterns
- extraction patternpublic void setMetadataExtractionPatterns(RegexFieldExtractor... patterns)
patterns
- extraction patternpublic Map<String,String> getEnvironmentVariables()
null
if using the current
process environment variablespublic void setEnvironmentVariables(Map<String,String> environmentVariables)
null
to use
the current process environment variables (default).environmentVariables
- environment variablespublic void addEnvironmentVariables(Map<String,String> environmentVariables)
null
to
setEnvironmentVariables(Map)
.environmentVariables
- environment variablespublic void addEnvironmentVariable(String name, String value)
null
name has no effect while null
values are converted to empty strings.name
- environment variable namevalue
- environment variable valuepublic String getMetadataInputFormat()
${INPUT}
token
is part of the command.public void setMetadataInputFormat(String metadataInputFormat)
${INPUT}
token
is part of the command.metadataInputFormat
- format of the metadata input filepublic String getMetadataOutputFormat()
${OUTPUT}
token
is part of the command.public void setMetadataOutputFormat(String metadataOutputFormat)
null
for relying metadata extraction
patterns instead.
Only applicable when the ${OUTPUT}
token
is part of the command.metadataOutputFormat
- format of the metadata output filepublic List<ImporterDocument> parseDocument(ImporterDocument doc, Writer output) throws DocumentParserException
IDocumentParser
parseDocument
in interface IDocumentParser
doc
- importer document to parseoutput
- where to store extracted or modified content of the
supplied documentDocumentParserException
- problem parsing documentpublic void loadFromXML(Reader in) throws IOException
loadFromXML
in interface IXMLConfigurable
IOException
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.