public class ExternalHandler extends Object
Class executing an external application to extract data from and/or manipulate a document.
When constructing the command to launch the external application, it will look for specific tokens to be replaced by file paths arguments (in addition to other arguments you may have). The path arguments are created by this class. They are case-sensitive and the file they represent are temporary (will be deleted after they have been dealt with). It is possible to omit one or more tokens to use standard streams instead where applicable.
Tokens supported by this class are:
${INPUT}
${INPUT_META}
${OUTPUT}
${OUTPUT_META}
${REFERENCE}
If ${INPUT_META}
is part of the command, metadata can be
provided to the external application in JSON (default), XML or
Properties format. Those
formats can also be used if ${OUTPUT_META}
is part of the
command. The formats are:
{
"field1" : [ "value1a", "value1b", "value1c" ],
"field2" : [ "value2" ],
"field3" : [ "value3a", "value3b" ]
}
Java Properties XML file format, with the exception that metadata with multiple values are supported, and will have their values joined by the symbol for record separator (U+241E). Example:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>My Comment</comment>
<entry key="field1">value1a␞value1b␞value1c</entry>
<entry key="field2">value2</entry>
<entry key="field3">value3a␞value3b</entry>
</properties>
Java Properties standard file format, with the exception that
metadata with multiple values are supported, and will have their values
joined by the symbol for record separator (U+241E). Refer to Java
Properties.loadFromProperties(java.io.Reader)
for
general syntax information.
Example:
# My Comment
field1 = value1a␞value1b␞value1c
field2 = value2
field3 = value3a␞value3b
It is possible to specify metadata extraction patterns that will be
applied either on the returned metadata file or from the standard output and
error streams. If ${OUTPUT_META}
is found in the command,
the output format will be
used to parse the outgoing metadata file. Leave the format to
null
to rely on extraction patterns for parsing the output file.
When ${OUTPUT_META}
is omitted, extraction patterns will be
applied to
the external application standard output and standard error streams. If
there are no ${OUTPUT_META}
and no metadata extraction patterns
are defined, it is assumed the external application did not produce any new
metadata.
When using metadata extraction patterns with standard streams, each pattern is applied on each line returned from STDOUT and STDERR. With each pattern, there could be a matadata field name supplied. If the pattern does not contain any match group, the entire matched expression will be used as the metadata field value.
Field names and values can be obtained by using the same regular
expression. This is done by using
match groups in your regular expressions (parenthesis). For each pattern
you define, you can specify which match group hold the field name and
which one holds the value.
Specifying a field match group is optional if a field
is provided. If no match groups are specified, a field
is expected.
If a target field with the same name already exists for a document,
values will be added to the end of the existing value list.
It is possible to change this default behavior by supplying a
PropertySetter
.
Execution environment variables can be set to replace environment variables defined for the current process.
To extract raw text from files, it is recommended to use an
ExternalParser
instead.
<command>
/Apps/myapp.exe ${INPUT} ${OUTPUT} ${INPUT_META} ${OUTPUT_META} ${REFERENCE}
</command>
<metadata
inputFormat="[json|xml|properties]"
outputFormat="[json|xml|properties]"
onSet="[append|prepend|replace|optional]">
<!--
Pattern only used when no output format is specified.
Repeat as needed.
-->
<pattern
toField="(toField name)"
fieldGroup="(toField name match group index)"
valueGroup="(value match group index)"
onSet="[append|prepend|replace|optional]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
dotAll="[false|true]"
unixLines="[false|true]"
literal="[false|true]"
comments="[false|true]"
multiline="[false|true]"
canonEq="[false|true]"
unicodeCase="[false|true]"
unicodeCharacterClass="[false|true]">
(regular expression)
</pattern>
</metadata>
<environment>
<!-- repeat variable tag as needed -->
<variable
name="(environment variable name)">
(environment variable value)
</variable>
</environment>
<tempDir>
(Optional directory where to store temporary files used
by this class.)
</tempDir>
Consuming classes implementing IXMLConfigurable
can use
the XML save/load methods of this class to inherit the above
(which they can support differently).
ExternalTagger
,
ExternalTransformer
,
ExternalParser
Modifier and Type | Field and Description |
---|---|
static String |
META_FORMAT_JSON |
static String |
META_FORMAT_PROPERTIES |
static String |
META_FORMAT_XML |
static String |
TOKEN_INPUT |
static String |
TOKEN_INPUT_META |
static String |
TOKEN_OUTPUT |
static String |
TOKEN_OUTPUT_META |
static String |
TOKEN_REFERENCE |
Constructor and Description |
---|
ExternalHandler() |
Modifier and Type | Method and Description |
---|---|
void |
addEnvironmentVariable(String name,
String value)
Adds an environment variables to the list of previously
assigned variables (if any).
|
void |
addEnvironmentVariables(Map<String,String> environmentVariables)
Adds the environment variables, keeping environment variables previously
assigned.
|
void |
addMetadataExtractionPattern(String field,
String pattern)
Adds a metadata extraction pattern that will extract the whole text
matched into the given field.
|
void |
addMetadataExtractionPattern(String field,
String pattern,
int valueGroup)
Adds a metadata extraction pattern, which will extract the value from
the specified group index upon matching.
|
void |
addMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
Adds a metadata extraction pattern that will extract matching field
names/values.
|
boolean |
equals(Object other) |
String |
getCommand()
Gets the command to execute.
|
Map<String,String> |
getEnvironmentVariables()
Gets environment variables.
|
List<RegexFieldValueExtractor> |
getMetadataExtractionPatterns()
Gets metadata extraction patterns.
|
String |
getMetadataInputFormat()
Gets the format of the metadata input file sent to the external
application.
|
String |
getMetadataOutputFormat()
Gets the format of the metadata output file from the external
application.
|
PropertySetter |
getOnSet()
Gets the property setter to use when a metadata value is set.
|
Path |
getTempDir()
Gets directory where to store temporary files sent to the external
handler as file paths.
|
void |
handleDocument(HandlerDoc doc,
InputStream input,
OutputStream output)
Invoke the external application on a document.
|
int |
hashCode() |
void |
loadHandlerFromXML(XML xml) |
void |
saveHandlerToXML(XML xml) |
void |
setCommand(String command)
Sets the command to execute.
|
void |
setEnvironmentVariables(Map<String,String> environmentVariables)
Sets the environment variables.
|
void |
setMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
Sets metadata extraction patterns.
|
void |
setMetadataInputFormat(String metadataInputFormat)
Sets the format of the metadata input file sent to the external
application.
|
void |
setMetadataOutputFormat(String metadataOutputFormat)
Sets the format of the metadata output file from the external
application.
|
void |
setOnSet(PropertySetter onSet)
Sets the property setter to use when a metadata value is set.
|
void |
setTempDir(Path tempDir)
Sets directory where to store temporary files sent to the external
handler as file paths.
|
String |
toString() |
public static final String TOKEN_INPUT
public static final String TOKEN_OUTPUT
public static final String TOKEN_INPUT_META
public static final String TOKEN_OUTPUT_META
public static final String TOKEN_REFERENCE
public static final String META_FORMAT_JSON
public static final String META_FORMAT_XML
public static final String META_FORMAT_PROPERTIES
public String getCommand()
public void setCommand(String command)
command
- the commandpublic Path getTempDir()
public void setTempDir(Path tempDir)
tempDir
- temporary directorypublic List<RegexFieldValueExtractor> getMetadataExtractionPatterns()
public void addMetadataExtractionPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addMetadataExtractionPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
patterns
- extraction patternpublic void setMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
patterns
- extraction patternpublic Map<String,String> getEnvironmentVariables()
null
if using the current
process environment variablespublic void setEnvironmentVariables(Map<String,String> environmentVariables)
null
to use
the current process environment variables (default).environmentVariables
- environment variablespublic void addEnvironmentVariables(Map<String,String> environmentVariables)
null
to
ExternalTransformer.setEnvironmentVariables(Map)
.environmentVariables
- environment variablespublic void addEnvironmentVariable(String name, String value)
null
name has no effect while null
values are converted to empty strings.name
- environment variable namevalue
- environment variable valuepublic String getMetadataInputFormat()
${INPUT}
token
is part of the command.public void setMetadataInputFormat(String metadataInputFormat)
${INPUT}
token
is part of the command.metadataInputFormat
- format of the metadata input filepublic String getMetadataOutputFormat()
${OUTPUT}
token
is part of the command.public void setMetadataOutputFormat(String metadataOutputFormat)
null
for relying metadata extraction
patterns instead.
Only applicable when the ${OUTPUT}
token
is part of the command.metadataOutputFormat
- format of the metadata output filepublic PropertySetter getOnSet()
public void setOnSet(PropertySetter onSet)
onSet
- property setterpublic void handleDocument(HandlerDoc doc, InputStream input, OutputStream output) throws ImporterHandlerException
doc
- documentinput
- document contentoutput
- processed document output streamImporterHandlerException
- failed to handle the documentpublic void loadHandlerFromXML(XML xml)
public void saveHandlerToXML(XML xml)
Copyright © 2009–2023 Norconex Inc.. All rights reserved.