public class ExternalTransformer extends AbstractDocumentTransformer
Transforms a document using an external application to do so.
This class relies on ExternalHandler
for most of the work.
Refer to ExternalHandler
for full documentation.
To parse/extract raw text from files, it is recommended to use
ExternalParser
instead.
<handler
class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<command>
c:\Apps\myapp.exe ${INPUT} ${OUTPUT} ${INPUT_META} ${OUTPUT_META}
${REFERENCE}
</command>
<metadata
inputFormat="[json|xml|properties]"
outputFormat="[json|xml|properties]"
onSet="[append|prepend|replace|optional]">
<!--
Pattern only used when no output format is specified.
Repeat as needed.
-->
<pattern
toField="(toField name)"
fieldGroup="(toField name match group index)"
valueGroup="(value match group index)"
onSet="[append|prepend|replace|optional]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
dotAll="[false|true]"
unixLines="[false|true]"
literal="[false|true]"
comments="[false|true]"
multiline="[false|true]"
canonEq="[false|true]"
unicodeCase="[false|true]"
unicodeCharacterClass="[false|true]">
(regular expression)
</pattern>
</metadata>
<environment>
<!-- repeat variable tag as needed -->
<variable
name="(environment variable name)">
(environment variable value)
</variable>
</environment>
<tempDir>
(Optional directory where to store temporary files used
for transformation.)
</tempDir>
</handler>
<handler
class="ExternalTransformer">
<command>/path/transform/app ${INPUT} ${OUTPUT}</command>
<metadata>
<pattern
field="docnumber"
valueGroup="1">
DocNo:(\d+)
</pattern>
</metadata>
</handler>
The above example invokes an external application that accepts two files as arguments: the first one being the file to transform, the second one being holding the transformation result. It also extract a document number from STDOUT, found as "DocNo:1234" and storing it as "docnumber".
ExternalHandler
Constructor and Description |
---|
ExternalTransformer() |
Modifier and Type | Method and Description |
---|---|
void |
addEnvironmentVariable(String name,
String value)
Adds an environment variables to the list of previously
assigned variables (if any).
|
void |
addEnvironmentVariables(Map<String,String> environmentVariables)
Adds the environment variables, keeping environment variables previously
assigned.
|
void |
addMetadataExtractionPattern(String field,
String pattern)
Adds a metadata extraction pattern that will extract the whole text
matched into the given field.
|
void |
addMetadataExtractionPattern(String field,
String pattern,
int valueGroup)
Adds a metadata extraction pattern, which will extract the value from
the specified group index upon matching.
|
void |
addMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
Adds a metadata extraction pattern that will extract matching field
names/values.
|
boolean |
equals(Object other) |
String |
getCommand()
Gets the command to execute.
|
Map<String,String> |
getEnvironmentVariables()
Gets environment variables.
|
List<RegexFieldValueExtractor> |
getMetadataExtractionPatterns()
Gets metadata extraction patterns.
|
String |
getMetadataInputFormat()
Gets the format of the metadata input file sent to the external
application.
|
String |
getMetadataOutputFormat()
Gets the format of the metadata output file from the external
application.
|
PropertySetter |
getOnSet()
Gets the property setter to use when a metadata value is set.
|
Path |
getTempDir()
Gets directory where to store temporary files used for transformation.
|
int |
hashCode() |
protected void |
loadHandlerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setCommand(String command)
Sets the command to execute.
|
void |
setEnvironmentVariables(Map<String,String> environmentVariables)
Sets the environment variables.
|
void |
setMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
Sets metadata extraction patterns.
|
void |
setMetadataInputFormat(String metadataInputFormat)
Sets the format of the metadata input file sent to the external
application.
|
void |
setMetadataOutputFormat(String metadataOutputFormat)
Sets the format of the metadata output file from the external
application.
|
void |
setOnSet(PropertySetter onSet)
Sets the property setter to use when a metadata value is set.
|
void |
setTempDir(Path tempDir)
Sets directory where to store temporary files used for transformation.
|
String |
toString() |
protected void |
transformApplicableDocument(HandlerDoc doc,
InputStream input,
OutputStream output,
ParseState parseState) |
transformDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
public String getCommand()
public void setCommand(String command)
command
- the commandpublic List<RegexFieldValueExtractor> getMetadataExtractionPatterns()
public void addMetadataExtractionPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addMetadataExtractionPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
patterns
- extraction patternpublic void setMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
patterns
- extraction patternpublic Map<String,String> getEnvironmentVariables()
null
if using the current
process environment variablespublic void setEnvironmentVariables(Map<String,String> environmentVariables)
null
to use
the current process environment variables (default).environmentVariables
- environment variablespublic void addEnvironmentVariables(Map<String,String> environmentVariables)
null
to
setEnvironmentVariables(Map)
.environmentVariables
- environment variablespublic void addEnvironmentVariable(String name, String value)
null
name has no effect while null
values are converted to empty strings.name
- environment variable namevalue
- environment variable valuepublic String getMetadataInputFormat()
${INPUT}
token
is part of the command.public void setMetadataInputFormat(String metadataInputFormat)
${INPUT}
token
is part of the command.metadataInputFormat
- format of the metadata input filepublic String getMetadataOutputFormat()
${OUTPUT}
token
is part of the command.public void setMetadataOutputFormat(String metadataOutputFormat)
null
for relying metadata extraction
patterns instead.
Only applicable when the ${OUTPUT}
token
is part of the command.metadataOutputFormat
- format of the metadata output filepublic PropertySetter getOnSet()
public void setOnSet(PropertySetter onSet)
onSet
- property setterpublic Path getTempDir()
public void setTempDir(Path tempDir)
tempDir
- temporary directoryprotected void transformApplicableDocument(HandlerDoc doc, InputStream input, OutputStream output, ParseState parseState) throws ImporterHandlerException
transformApplicableDocument
in class AbstractDocumentTransformer
ImporterHandlerException
protected void loadHandlerFromXML(XML xml)
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- XML configurationprotected void saveHandlerToXML(XML xml)
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2023 Norconex Inc.. All rights reserved.