public class ExternalTransformer extends AbstractDocumentTransformer implements IXMLConfigurable
Transforms a document content using an external application to do so.
Since 2.8.0, it is now possible to also pass the document metadata and reference to the external application and get new metadata back. 2.8.0 also makes metadata fields regular expression matching more flexible.
Since 2.8.0, it is also possible to set regular expressions case-sensitivity for each patterns.
Since 2.8.0, match group indexes can be specified
to extract field names and values using the same regular
expression. This is done by using
match groups in your regular expressions (parenthesis). For each pattern
you define, you can specify which match group hold the field name and
which one holds the value.
Specifying a field match group is optional if a field
is provided. If no match groups are specified, a field
is expected.
When constructing the command to launch the external application, this transformer will look for specific tokens to be replaced by file paths arguments (in addition to other arguments you may have). The paths are created by this transformer. They are case-sensitive and the file they represent are temporary (will be deleted after the transformation). It is possible to omit one or more tokens to use standard streams instead where applicable. These tokens are:
${INPUT}
${INPUT_META}
${OUTPUT}
${OUTPUT_META}
${REFERENCE}
If ${INPUT_META}
is part of the command, metadata can be
provided to the external application in JSON (default) or XML format or
Properties. Those
formats can also be used if ${OUTPUT_META}
is part of the
command. The formats are:
{ "field1" : [ "value1a", "value1b", "value1c" ], "field2" : [ "value2" ], "field3" : [ "value3a", "value3b" ] }
Java Properties XML file format, with the exception that metadata with multiple values are supported, and will have their values saved on different lines (repeating the key). Example:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>My Comment</comment> <entry key="field1">value1a</entry> <entry key="field1">value1b</entry> <entry key="field1">value1c</entry> <entry key="field2">value2</entry> <entry key="field3">value3a</entry> <entry key="field3">value3b</entry> </properties>
Java Properties standard file format, with the exception that
metadata with multiple values are supported, and will have their values
saved on different lines (repeating the key). Refer to Java
Properties.load(java.io.Reader)
for syntax information.
Example:
# My Comment field1 = value1a field1 = value1b field1 = value1c field2 = value2 field3 = value3a field3 = value3b
It is possible to specify metadata extraction patterns that will be
applied either on the returned metadata file or from the standard output and
error streams. If ${OUTPUT_META}
is found in the command,
the output format will be
used to parse the outgoing metadata file. Leave the format to
null
to rely on extraction patterns for parsing the output file.
When ${OUTPUT_META}
is omitted, extraction patterns will be
applied to
the external application standard output and standard error streams. If
there are no ${OUTPUT_META}
and no metadata extraction patterns
are defined, it is assumed the external application did not produce any new
metadata.
When using metadata extraction patterns with standard streams, each pattern is applied on each line returned from STDOUT and STDERR. With each pattern, there could be a matadata field name supplied. If the pattern does not contain any match group, the entire matched expression will be used as the metadata field value.
Field names and values can be obtained by using the same regular
expression. This is done by using
match groups in your regular expressions (parenthesis). For each pattern
you define, you can specify which match group hold the field name and
which one holds the value.
Specifying a field match group is optional if a field
is provided. If no match groups are specified, a field
is expected.
Execution environment variables can be set to replace environment variables defined for the current process.
To extract raw text from files, it is recommended to use an
ExternalParser
instead.
<transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer"> <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <command> c:\Apps\myapp.exe ${INPUT} ${OUTPUT} ${INPUT_META} ${OUTPUT_META} ${REFERENCE} </command> <metadata inputFormat="[json|xml|properties]" outputFormat="[json|xml|properties]"> <!-- pattern only used when no output format is specified --> <pattern field="(target field name)" fieldGroup="(field name match group index)" valueGroup="(field value match group index)" caseSensitive="[false|true]"> (regular expression) </pattern> <!-- repeat pattern tag as needed --> </metadata> <environment> <variable name="(environment variable name)"> (environment variable value) </variable> <!-- repeat variable tag as needed --> </environment> <tempDir> (Optional directory where to store temporary files used for transformation.) </tempDir> </transformer>
The following example invokes an external application that accepts two files as arguments: the first one being the file to transform, the second one being holding the transformation result. It also extract a document number from STDOUT, found as "DocNo:1234" and storing it as "docnumber".
<transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer"> <command>/path/transform/app ${INPUT} ${OUTPUT}</command> <metadata> <match field="docnumber" valueGroup="1">DocNo:(\d+)</match> </metadata> </transformer>
ExternalParser
Modifier and Type | Field and Description |
---|---|
static String |
META_FORMAT_JSON |
static String |
META_FORMAT_PROPERTIES |
static String |
META_FORMAT_XML |
static String |
REVERSE_FLAG
Deprecated.
Since 2.8.0, specify field name and value match groups
instead.
|
static String |
TOKEN_INPUT |
static String |
TOKEN_INPUT_META |
static String |
TOKEN_OUTPUT |
static String |
TOKEN_OUTPUT_META |
static String |
TOKEN_REFERENCE |
Constructor and Description |
---|
ExternalTransformer() |
Modifier and Type | Method and Description |
---|---|
void |
addEnvironmentVariable(String name,
String value)
Adds an environment variables to the list of previously
assigned variables (if any).
|
void |
addEnvironmentVariables(Map<String,String> environmentVariables)
Adds the environment variables, keeping environment variables previously
assigned.
|
void |
addMetadataExtractionPattern(Pattern pattern,
boolean reverse)
Deprecated.
Since 2.8.0, use
addMetadataExtractionPatterns(RegexFieldExtractor...) |
void |
addMetadataExtractionPattern(Pattern pattern,
String field)
Deprecated.
Since 2.8.0, use
addMetadataExtractionPatterns(RegexFieldExtractor...) |
void |
addMetadataExtractionPattern(String field,
String pattern)
Adds a metadata extraction pattern that will extract the whole text
matched into the given field.
|
void |
addMetadataExtractionPattern(String field,
String pattern,
int valueGroup)
Adds a metadata extraction pattern, which will extract the value from
the specified group index upon matching.
|
void |
addMetadataExtractionPatterns(Map<Pattern,String> metaPatterns)
Deprecated.
Since 2.8.0, use
addMetadataExtractionPatterns(RegexFieldExtractor...) |
void |
addMetadataExtractionPatterns(RegexFieldExtractor... patterns)
Adds a metadata extraction pattern that will extract matching field
names/values.
|
boolean |
equals(Object other) |
String |
getCommand()
Gets the command to execute.
|
Map<String,String> |
getEnvironmentVariables()
Gets environment variables.
|
List<RegexFieldExtractor> |
getMetadataExtractionPatterns()
Gets metadata extraction patterns.
|
String |
getMetadataInputFormat()
Gets the format of the metadata input file sent to the external
application.
|
String |
getMetadataOutputFormat()
Gets the format of the metadata output file from the external
application.
|
File |
getTempDir()
Gets directory where to store temporary files used for transformation.
|
int |
hashCode() |
protected void |
loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setCommand(String command)
Sets the command to execute.
|
void |
setEnvironmentVariables(Map<String,String> environmentVariables)
Sets the environment variables.
|
void |
setMetadataExtractionPatterns(Map<Pattern,String> metaPatterns)
Deprecated.
Since 2.8.0, use
addMetadataExtractionPatterns(RegexFieldExtractor...) |
void |
setMetadataExtractionPatterns(RegexFieldExtractor... patterns)
Sets metadata extraction patterns.
|
void |
setMetadataInputFormat(String metadataInputFormat)
Sets the format of the metadata input file sent to the external
application.
|
void |
setMetadataOutputFormat(String metadataOutputFormat)
Sets the format of the metadata output file from the external
application.
|
void |
setTempDir(File tempDir)
Sets directory where to store temporary files used for transformation.
|
String |
toString() |
protected void |
transformApplicableDocument(String reference,
InputStream input,
OutputStream output,
ImporterMetadata metadata,
boolean parsed) |
transformDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public static final String TOKEN_INPUT
public static final String TOKEN_OUTPUT
public static final String TOKEN_INPUT_META
public static final String TOKEN_OUTPUT_META
public static final String TOKEN_REFERENCE
public static final String META_FORMAT_JSON
public static final String META_FORMAT_XML
public static final String META_FORMAT_PROPERTIES
@Deprecated public static final String REVERSE_FLAG
public String getCommand()
public void setCommand(String command)
command
- the commandpublic File getTempDir()
public void setTempDir(File tempDir)
tempDir
- temporary directorypublic List<RegexFieldExtractor> getMetadataExtractionPatterns()
@Deprecated public void setMetadataExtractionPatterns(Map<Pattern,String> metaPatterns)
addMetadataExtractionPatterns(RegexFieldExtractor...)
metaPatterns
- map of patterns and field names@Deprecated public void addMetadataExtractionPatterns(Map<Pattern,String> metaPatterns)
addMetadataExtractionPatterns(RegexFieldExtractor...)
metaPatterns
- map of patterns and field names@Deprecated public void addMetadataExtractionPattern(Pattern pattern, boolean reverse)
addMetadataExtractionPatterns(RegexFieldExtractor...)
pattern
- pattern with two match groupsreverse
- whether to reverse match groups (inverse key and value).@Deprecated public void addMetadataExtractionPattern(Pattern pattern, String field)
addMetadataExtractionPatterns(RegexFieldExtractor...)
pattern
- pattern with no or one match groupfield
- field name where to store the matched patternpublic void addMetadataExtractionPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addMetadataExtractionPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addMetadataExtractionPatterns(RegexFieldExtractor... patterns)
patterns
- extraction patternpublic void setMetadataExtractionPatterns(RegexFieldExtractor... patterns)
patterns
- extraction patternpublic Map<String,String> getEnvironmentVariables()
null
if using the current
process environment variablespublic void setEnvironmentVariables(Map<String,String> environmentVariables)
null
to use
the current process environment variables (default).environmentVariables
- environment variablespublic void addEnvironmentVariables(Map<String,String> environmentVariables)
null
to
setEnvironmentVariables(Map)
.environmentVariables
- environment variablespublic void addEnvironmentVariable(String name, String value)
null
name has no effect while null
values are converted to empty strings.name
- environment variable namevalue
- environment variable valuepublic String getMetadataInputFormat()
${INPUT}
token
is part of the command.public void setMetadataInputFormat(String metadataInputFormat)
${INPUT}
token
is part of the command.metadataInputFormat
- format of the metadata input filepublic String getMetadataOutputFormat()
${OUTPUT}
token
is part of the command.public void setMetadataOutputFormat(String metadataOutputFormat)
null
for relying metadata extraction
patterns instead.
Only applicable when the ${OUTPUT}
token
is part of the command.metadataOutputFormat
- format of the metadata output fileprotected void transformApplicableDocument(String reference, InputStream input, OutputStream output, ImporterMetadata metadata, boolean parsed) throws ImporterHandlerException
transformApplicableDocument
in class AbstractDocumentTransformer
ImporterHandlerException
protected void loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- xml configurationIOException
- could not load from XMLprotected void saveHandlerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
writer
- the xml writerXMLStreamException
- could not save to XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2021 Norconex Inc.. All rights reserved.