Class ExternalParser
- All Implemented Interfaces:
IXMLConfigurable,IDocumentParser
Parses and extracts text from a file using an external application to do so.
This class relies on ExternalHandler for most of the work.
Refer to ExternalHandler for full documentation.
This parser can be made configurable via XML. See
GenericDocumentParserFactory for general indications how
to configure parsers.
To use an external application to change a file content after parsing has
already occurred, consider using ExternalTransformer instead.
XML configuration usage:
<parser
contentType="(content type this parser is associated to)"
class="com.norconex.importer.parser.impl.ExternalParser">
<command>
c:\Apps\myapp.exe ${INPUT} ${OUTPUT} ${INPUT_META} ${OUTPUT_META}
${REFERENCE}
</command>
<metadata
inputFormat="[json|xml|properties]"
outputFormat="[json|xml|properties]">
<!-- pattern only used when no output format is specified -->
<pattern>(regular expression)</pattern>
<!-- repeat pattern tag as needed -->
</metadata>
<environment>
<variable
name="(environment variable name)">
(environment variable value)
</variable>
<!-- repeat variable tag as needed -->
</environment>
</parser>
XML usage example:
<parser
contentType="text/plain"
class="com.norconex.importer.parser.impl.ExternalParser">
<command>/path/transform/app ${INPUT} ${OUTPUT}</command>
<metadata>
<pattern
field="docnumber"
valueGroup="1">
DocNo:(\d+)
</pattern>
</metadata>
</parser>
The above example invokes an external application processing for simple text files that accepts two files as arguments: the first one being the file to transform, the second one being holding the transformation result. It also extract a document number from STDOUT, found as "DocNo:1234" and storing it as "docnumber".
- Since:
- 2.2.0
- Author:
- Pascal Essiembre
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidaddEnvironmentVariable(String name, String value) Adds an environment variables to the list of previously assigned variables (if any).voidaddEnvironmentVariables(Map<String, String> environmentVariables) Adds the environment variables, keeping environment variables previously assigned.voidaddMetadataExtractionPattern(String field, String pattern) Adds a metadata extraction pattern that will extract the whole text matched into the given field.voidaddMetadataExtractionPattern(String field, String pattern, int valueGroup) Adds a metadata extraction pattern, which will extract the value from the specified group index upon matching.voidaddMetadataExtractionPatterns(RegexFieldValueExtractor... patterns) Adds a metadata extraction pattern that will extract matching field names/values.booleanGets the command to execute.Gets environment variables.Gets metadata extraction patterns.Gets the format of the metadata input file sent to the external application.Gets the format of the metadata output file from the external application.getOnSet()Gets the property setter to use when a metadata value is set.Gets directory where to store temporary files used for transformation.inthashCode()voidloadFromXML(XML xml) parseDocument(Doc doc, Writer output) Parses a document.voidvoidsetCommand(String command) Sets the command to execute.voidsetEnvironmentVariables(Map<String, String> environmentVariables) Sets the environment variables.voidsetMetadataExtractionPatterns(RegexFieldValueExtractor... patterns) Sets metadata extraction patterns.voidsetMetadataInputFormat(String metadataInputFormat) Sets the format of the metadata input file sent to the external application.voidsetMetadataOutputFormat(String metadataOutputFormat) Sets the format of the metadata output file from the external application.voidsetOnSet(PropertySetter onSet) Sets the property setter to use when a metadata value is set.voidsetTempDir(Path tempDir) Sets directory where to store temporary files used for transformation.toString()
-
Constructor Details
-
ExternalParser
public ExternalParser()
-
-
Method Details
-
getCommand
Gets the command to execute.- Returns:
- the command
-
setCommand
Sets the command to execute. Make sure to escape spaces in executable path and its arguments as well as other special command line characters.- Parameters:
command- the command
-
getTempDir
Gets directory where to store temporary files used for transformation.- Returns:
- temporary directory
- Since:
- 2.8.0
-
setTempDir
Sets directory where to store temporary files used for transformation.- Parameters:
tempDir- temporary directory- Since:
- 2.8.0
-
getMetadataExtractionPatterns
Gets metadata extraction patterns. See class documentation.- Returns:
- map of patterns and field names
-
addMetadataExtractionPattern
Adds a metadata extraction pattern that will extract the whole text matched into the given field.- Parameters:
field- target field to store the matching pattern.pattern- the pattern- Since:
- 2.8.0
-
addMetadataExtractionPattern
Adds a metadata extraction pattern, which will extract the value from the specified group index upon matching.- Parameters:
field- target field to store the matching pattern.pattern- the patternvalueGroup- which pattern group to return.- Since:
- 2.8.0
-
addMetadataExtractionPatterns
Adds a metadata extraction pattern that will extract matching field names/values.- Parameters:
patterns- extraction pattern- Since:
- 2.8.0
-
setMetadataExtractionPatterns
Sets metadata extraction patterns. Clears any previously assigned patterns.- Parameters:
patterns- extraction pattern- Since:
- 2.8.0
-
getOnSet
Gets the property setter to use when a metadata value is set.- Returns:
- property setter
- Since:
- 3.0.0
-
setOnSet
Sets the property setter to use when a metadata value is set.- Parameters:
onSet- property setter- Since:
- 3.0.0
-
getEnvironmentVariables
Gets environment variables.- Returns:
- environment variables or
nullif using the current process environment variables
-
setEnvironmentVariables
Sets the environment variables. Clearing any prevously assigned environment variables. Setnullto use the current process environment variables (default).- Parameters:
environmentVariables- environment variables
-
addEnvironmentVariables
Adds the environment variables, keeping environment variables previously assigned. Existing variables of the same name will be overwritten. To clear all previously assigned variables and use the current process environment variables, passnulltosetEnvironmentVariables(Map).- Parameters:
environmentVariables- environment variables
-
addEnvironmentVariable
Adds an environment variables to the list of previously assigned variables (if any). Existing variables of the same name will be overwritten. Setting a variable with anullname has no effect whilenullvalues are converted to empty strings.- Parameters:
name- environment variable namevalue- environment variable value
-
getMetadataInputFormat
Gets the format of the metadata input file sent to the external application. One of "json" (default), "xml", or "properties" is expected. Only applicable when the${INPUT}token is part of the command.- Returns:
- metadata input format
- Since:
- 2.8.0
-
setMetadataInputFormat
Sets the format of the metadata input file sent to the external application. One of "json" (default), "xml", or "properties" is expected. Only applicable when the${INPUT}token is part of the command.- Parameters:
metadataInputFormat- format of the metadata input file- Since:
- 2.8.0
-
getMetadataOutputFormat
Gets the format of the metadata output file from the external application. By default no format is set, and metadata extraction patterns are used to extract metadata information. One of "json", "xml", or "properties" is expected. Only applicable when the${OUTPUT}token is part of the command.- Returns:
- metadata output format
- Since:
- 2.8.0
-
setMetadataOutputFormat
Sets the format of the metadata output file from the external application. One of "json" (default), "xml", or "properties" is expected. Set tonullfor relying metadata extraction patterns instead. Only applicable when the${OUTPUT}token is part of the command.- Parameters:
metadataOutputFormat- format of the metadata output file- Since:
- 2.8.0
-
parseDocument
Description copied from interface:IDocumentParserParses a document.- Specified by:
parseDocumentin interfaceIDocumentParser- Parameters:
doc- importer document to parseoutput- where to store extracted or modified content of the supplied document- Returns:
- a list of first-level embedded documents, if any
- Throws:
DocumentParserException- problem parsing document
-
loadFromXML
- Specified by:
loadFromXMLin interfaceIXMLConfigurable
-
saveToXML
- Specified by:
saveToXMLin interfaceIXMLConfigurable
-
equals
-
hashCode
public int hashCode() -
toString
-