public class ExternalParser extends Object implements IDocumentParser, IXMLConfigurable
Parses and extracts text from a file using an external application to do so.
This class relies on ExternalHandler
for most of the work.
Refer to ExternalHandler
for full documentation.
This parser can be made configurable via XML. See
GenericDocumentParserFactory
for general indications how
to configure parsers.
To use an external application to change a file content after parsing has
already occurred, consider using ExternalTransformer
instead.
<parser
contentType="(content type this parser is associated to)"
class="com.norconex.importer.parser.impl.ExternalParser">
<command>
c:\Apps\myapp.exe ${INPUT} ${OUTPUT} ${INPUT_META} ${OUTPUT_META}
${REFERENCE}
</command>
<metadata
inputFormat="[json|xml|properties]"
outputFormat="[json|xml|properties]"
onSet="[append|prepend|replace|optional]">
<!-- pattern only used when no output format is specified -->
<pattern
toField="(toField name)"
fieldGroup="(toField name match group index)"
valueGroup="(value match group index)"
onSet="[append|prepend|replace|optional]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
dotAll="[false|true]"
unixLines="[false|true]"
literal="[false|true]"
comments="[false|true]"
multiline="[false|true]"
canonEq="[false|true]"
unicodeCase="[false|true]"
unicodeCharacterClass="[false|true]">
(regular expression)
</pattern>
<!-- repeat pattern tag as needed -->
</metadata>
<environment>
<variable
name="(environment variable name)">
(environment variable value)
</variable>
<!-- repeat variable tag as needed -->
</environment>
</parser>
<parser
contentType="text/plain"
class="com.norconex.importer.parser.impl.ExternalParser">
<command>/path/transform/app ${INPUT} ${OUTPUT}</command>
<metadata>
<pattern
field="docnumber"
valueGroup="1">
DocNo:(\d+)
</pattern>
</metadata>
</parser>
The above example invokes an external application processing for simple text files that accepts two files as arguments: the first one being the file to transform, the second one being holding the transformation result. It also extract a document number from STDOUT, found as "DocNo:1234" and storing it as "docnumber".
ExternalHandler
Constructor and Description |
---|
ExternalParser() |
Modifier and Type | Method and Description |
---|---|
void |
addEnvironmentVariable(String name,
String value)
Adds an environment variables to the list of previously
assigned variables (if any).
|
void |
addEnvironmentVariables(Map<String,String> environmentVariables)
Adds the environment variables, keeping environment variables previously
assigned.
|
void |
addMetadataExtractionPattern(String field,
String pattern)
Adds a metadata extraction pattern that will extract the whole text
matched into the given field.
|
void |
addMetadataExtractionPattern(String field,
String pattern,
int valueGroup)
Adds a metadata extraction pattern, which will extract the value from
the specified group index upon matching.
|
void |
addMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
Adds a metadata extraction pattern that will extract matching field
names/values.
|
boolean |
equals(Object other) |
String |
getCommand()
Gets the command to execute.
|
Map<String,String> |
getEnvironmentVariables()
Gets environment variables.
|
List<RegexFieldValueExtractor> |
getMetadataExtractionPatterns()
Gets metadata extraction patterns.
|
String |
getMetadataInputFormat()
Gets the format of the metadata input file sent to the external
application.
|
String |
getMetadataOutputFormat()
Gets the format of the metadata output file from the external
application.
|
PropertySetter |
getOnSet()
Gets the property setter to use when a metadata value is set.
|
Path |
getTempDir()
Gets directory where to store temporary files used for transformation.
|
int |
hashCode() |
void |
loadFromXML(XML xml) |
List<Doc> |
parseDocument(Doc doc,
Writer output)
Parses a document.
|
void |
saveToXML(XML xml) |
void |
setCommand(String command)
Sets the command to execute.
|
void |
setEnvironmentVariables(Map<String,String> environmentVariables)
Sets the environment variables.
|
void |
setMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
Sets metadata extraction patterns.
|
void |
setMetadataInputFormat(String metadataInputFormat)
Sets the format of the metadata input file sent to the external
application.
|
void |
setMetadataOutputFormat(String metadataOutputFormat)
Sets the format of the metadata output file from the external
application.
|
void |
setOnSet(PropertySetter onSet)
Sets the property setter to use when a metadata value is set.
|
void |
setTempDir(Path tempDir)
Sets directory where to store temporary files used for transformation.
|
String |
toString() |
public String getCommand()
public void setCommand(String command)
command
- the commandpublic Path getTempDir()
public void setTempDir(Path tempDir)
tempDir
- temporary directorypublic List<RegexFieldValueExtractor> getMetadataExtractionPatterns()
public void addMetadataExtractionPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addMetadataExtractionPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
patterns
- extraction patternpublic void setMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
patterns
- extraction patternpublic PropertySetter getOnSet()
public void setOnSet(PropertySetter onSet)
onSet
- property setterpublic Map<String,String> getEnvironmentVariables()
null
if using the current
process environment variablespublic void setEnvironmentVariables(Map<String,String> environmentVariables)
null
to use
the current process environment variables (default).environmentVariables
- environment variablespublic void addEnvironmentVariables(Map<String,String> environmentVariables)
null
to
setEnvironmentVariables(Map)
.environmentVariables
- environment variablespublic void addEnvironmentVariable(String name, String value)
null
name has no effect while null
values are converted to empty strings.name
- environment variable namevalue
- environment variable valuepublic String getMetadataInputFormat()
${INPUT}
token
is part of the command.public void setMetadataInputFormat(String metadataInputFormat)
${INPUT}
token
is part of the command.metadataInputFormat
- format of the metadata input filepublic String getMetadataOutputFormat()
${OUTPUT}
token
is part of the command.public void setMetadataOutputFormat(String metadataOutputFormat)
null
for relying metadata extraction
patterns instead.
Only applicable when the ${OUTPUT}
token
is part of the command.metadataOutputFormat
- format of the metadata output filepublic List<Doc> parseDocument(Doc doc, Writer output) throws DocumentParserException
IDocumentParser
parseDocument
in interface IDocumentParser
doc
- importer document to parseoutput
- where to store extracted or modified content of the
supplied documentDocumentParserException
- problem parsing documentpublic void loadFromXML(XML xml)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(XML xml)
saveToXML
in interface IXMLConfigurable
Copyright © 2009–2023 Norconex Inc.. All rights reserved.