public class ExternalTagger extends AbstractDocumentTagger
Extracts metadata from a document using an external application to do so.
This class relies on ExternalHandler
for most of the work.
Refer to ExternalHandler
for full documentation, except for
the following differences this class has:
${OUTPUT}
token (since taggers do not
modify content).
setInputDisabled(boolean)
.
To use an external application to change a file content consider using
ExternalTransformer
instead.
To parse/extract raw text from files, it is recommended to use
ExternalParser
instead.
<handler
class="com.norconex.importer.handler.tagger.impl.ExternalTagger">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<command
inputDisabled="[false|true]">
c:\Apps\myapp.exe ${INPUT} ${INPUT_META} ${OUTPUT_META} ${REFERENCE}
</command>
<metadata
inputFormat="[json|xml|properties]"
outputFormat="[json|xml|properties]"
onSet="[append|prepend|replace|optional]">
<!--
Pattern only used when no output format is specified. Repeat as needed.
-->
<pattern
toField="(toField name)"
fieldGroup="(toField name match group index)"
valueGroup="(value match group index)"
onSet="[append|prepend|replace|optional]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
dotAll="[false|true]"
unixLines="[false|true]"
literal="[false|true]"
comments="[false|true]"
multiline="[false|true]"
canonEq="[false|true]"
unicodeCase="[false|true]"
unicodeCharacterClass="[false|true]">
(regular expression)
</pattern>
</metadata>
<environment>
<!-- repeat variable tag as needed -->
<variable
name="(environment variable name)">
(environment variable value)
</variable>
</environment>
<tempDir>
(Optional directory where to store temporary files used
for transformation.)
</tempDir>
</handler>
<handler
class="ExternalTagger">
<command>/path/tag/app ${INPUT} ${OUTPUT_META}</command>
</handler>
The above example invokes an external application that accepts a document to transform and outputs a file containing the new metadata information.
ExternalHandler
Constructor and Description |
---|
ExternalTagger() |
Modifier and Type | Method and Description |
---|---|
void |
addEnvironmentVariable(String name,
String value)
Adds an environment variables to the list of previously
assigned variables (if any).
|
void |
addEnvironmentVariables(Map<String,String> environmentVariables)
Adds the environment variables, keeping environment variables previously
assigned.
|
void |
addMetadataExtractionPattern(String field,
String pattern)
Adds a metadata extraction pattern that will extract the whole text
matched into the given field.
|
void |
addMetadataExtractionPattern(String field,
String pattern,
int valueGroup)
Adds a metadata extraction pattern, which will extract the value from
the specified group index upon matching.
|
void |
addMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
Adds a metadata extraction pattern that will extract matching field
names/values.
|
boolean |
equals(Object other) |
String |
getCommand()
Gets the command to execute.
|
Map<String,String> |
getEnvironmentVariables()
Gets environment variables.
|
List<RegexFieldValueExtractor> |
getMetadataExtractionPatterns()
Gets metadata extraction patterns.
|
String |
getMetadataInputFormat()
Gets the format of the metadata input file sent to the external
application.
|
String |
getMetadataOutputFormat()
Gets the format of the metadata output file from the external
application.
|
PropertySetter |
getOnSet()
Gets the property setter to use when a metadata value is set.
|
Path |
getTempDir()
Gets directory where to store temporary files used for transformation.
|
int |
hashCode() |
boolean |
isInputDisabled()
Gets whether to send the document content or not, regardless
whether ${INPUT} token is part of the command or not.
|
protected void |
loadHandlerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setCommand(String command)
Sets the command to execute.
|
void |
setEnvironmentVariables(Map<String,String> environmentVariables)
Sets the environment variables.
|
void |
setInputDisabled(boolean inputDisabled)
Sets whether to send the document content or not, regardless
whether ${INPUT} token is part of the command or not.
|
void |
setMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
Sets metadata extraction patterns.
|
void |
setMetadataInputFormat(String metadataInputFormat)
Sets the format of the metadata input file sent to the external
application.
|
void |
setMetadataOutputFormat(String metadataOutputFormat)
Sets the format of the metadata output file from the external
application.
|
void |
setOnSet(PropertySetter onSet)
Sets the property setter to use when a metadata value is set.
|
void |
setTempDir(Path tempDir)
Sets directory where to store temporary files used for transformation.
|
void |
tagApplicableDocument(HandlerDoc doc,
InputStream document,
ParseState parseState) |
String |
toString() |
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
public boolean isInputDisabled()
true
to prevent sending the input contentpublic void setInputDisabled(boolean inputDisabled)
inputDisabled
- true
to prevent sending the
input contentpublic String getCommand()
public void setCommand(String command)
command
- the commandpublic List<RegexFieldValueExtractor> getMetadataExtractionPatterns()
public void addMetadataExtractionPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addMetadataExtractionPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
patterns
- extraction patternpublic void setMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
patterns
- extraction patternpublic Map<String,String> getEnvironmentVariables()
null
if using the current
process environment variablespublic void setEnvironmentVariables(Map<String,String> environmentVariables)
null
to use
the current process environment variables (default).environmentVariables
- environment variablespublic void addEnvironmentVariables(Map<String,String> environmentVariables)
null
to
setEnvironmentVariables(Map)
.environmentVariables
- environment variablespublic void addEnvironmentVariable(String name, String value)
null
name has no effect while null
values are converted to empty strings.name
- environment variable namevalue
- environment variable valuepublic String getMetadataInputFormat()
${INPUT}
token
is part of the command.public void setMetadataInputFormat(String metadataInputFormat)
${INPUT}
token
is part of the command.metadataInputFormat
- format of the metadata input filepublic String getMetadataOutputFormat()
${OUTPUT}
token
is part of the command.public void setMetadataOutputFormat(String metadataOutputFormat)
null
for relying metadata extraction
patterns instead.
Only applicable when the ${OUTPUT}
token
is part of the command.metadataOutputFormat
- format of the metadata output filepublic PropertySetter getOnSet()
public void setOnSet(PropertySetter onSet)
onSet
- property setterpublic Path getTempDir()
public void setTempDir(Path tempDir)
tempDir
- temporary directorypublic void tagApplicableDocument(HandlerDoc doc, InputStream document, ParseState parseState) throws ImporterHandlerException
tagApplicableDocument
in class AbstractDocumentTagger
ImporterHandlerException
protected void loadHandlerFromXML(XML xml)
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- XML configurationprotected void saveHandlerToXML(XML xml)
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2023 Norconex Inc.. All rights reserved.