Class ExternalParser

java.lang.Object
com.norconex.importer.parser.impl.ExternalParser
All Implemented Interfaces:
IXMLConfigurable, IDocumentParser

public class ExternalParser extends Object implements IDocumentParser, IXMLConfigurable

Parses and extracts text from a file using an external application to do so.

This class relies on ExternalHandler for most of the work. Refer to ExternalHandler for full documentation.

This parser can be made configurable via XML. See GenericDocumentParserFactory for general indications how to configure parsers.

To use an external application to change a file content after parsing has already occurred, consider using ExternalTransformer instead.

XML configuration usage:


<parser
    contentType="(content type this parser is associated to)"
    class="com.norconex.importer.parser.impl.ExternalParser">
  <command>
    c:\Apps\myapp.exe ${INPUT} ${OUTPUT} ${INPUT_META} ${OUTPUT_META}
    ${REFERENCE}
  </command>
  <metadata
      inputFormat="[json|xml|properties]"
      outputFormat="[json|xml|properties]">
    <!-- pattern only used when no output format is specified -->
    <pattern>(regular expression)</pattern>
    <!-- repeat pattern tag as needed -->
  </metadata>
  <environment>
    <variable
        name="(environment variable name)">
      (environment variable value)
    </variable>
    <!-- repeat variable tag as needed -->
  </environment>
</parser>

XML usage example:


<parser
    contentType="text/plain"
    class="com.norconex.importer.parser.impl.ExternalParser">
  <command>/path/transform/app ${INPUT} ${OUTPUT}</command>
  <metadata>
    <pattern
        field="docnumber"
        valueGroup="1">
      DocNo:(\d+)
    </pattern>
  </metadata>
</parser>

The above example invokes an external application processing for simple text files that accepts two files as arguments: the first one being the file to transform, the second one being holding the transformation result. It also extract a document number from STDOUT, found as "DocNo:1234" and storing it as "docnumber".

Since:
2.2.0
Author:
Pascal Essiembre
See Also:
  • Constructor Details

    • ExternalParser

      public ExternalParser()
  • Method Details

    • getCommand

      public String getCommand()
      Gets the command to execute.
      Returns:
      the command
    • setCommand

      public void setCommand(String command)
      Sets the command to execute. Make sure to escape spaces in executable path and its arguments as well as other special command line characters.
      Parameters:
      command - the command
    • getTempDir

      public Path getTempDir()
      Gets directory where to store temporary files used for transformation.
      Returns:
      temporary directory
      Since:
      2.8.0
    • setTempDir

      public void setTempDir(Path tempDir)
      Sets directory where to store temporary files used for transformation.
      Parameters:
      tempDir - temporary directory
      Since:
      2.8.0
    • getMetadataExtractionPatterns

      public List<RegexFieldValueExtractor> getMetadataExtractionPatterns()
      Gets metadata extraction patterns. See class documentation.
      Returns:
      map of patterns and field names
    • addMetadataExtractionPattern

      public void addMetadataExtractionPattern(String field, String pattern)
      Adds a metadata extraction pattern that will extract the whole text matched into the given field.
      Parameters:
      field - target field to store the matching pattern.
      pattern - the pattern
      Since:
      2.8.0
    • addMetadataExtractionPattern

      public void addMetadataExtractionPattern(String field, String pattern, int valueGroup)
      Adds a metadata extraction pattern, which will extract the value from the specified group index upon matching.
      Parameters:
      field - target field to store the matching pattern.
      pattern - the pattern
      valueGroup - which pattern group to return.
      Since:
      2.8.0
    • addMetadataExtractionPatterns

      public void addMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
      Adds a metadata extraction pattern that will extract matching field names/values.
      Parameters:
      patterns - extraction pattern
      Since:
      2.8.0
    • setMetadataExtractionPatterns

      public void setMetadataExtractionPatterns(RegexFieldValueExtractor... patterns)
      Sets metadata extraction patterns. Clears any previously assigned patterns.
      Parameters:
      patterns - extraction pattern
      Since:
      2.8.0
    • getOnSet

      public PropertySetter getOnSet()
      Gets the property setter to use when a metadata value is set.
      Returns:
      property setter
      Since:
      3.0.0
    • setOnSet

      public void setOnSet(PropertySetter onSet)
      Sets the property setter to use when a metadata value is set.
      Parameters:
      onSet - property setter
      Since:
      3.0.0
    • getEnvironmentVariables

      public Map<String,String> getEnvironmentVariables()
      Gets environment variables.
      Returns:
      environment variables or null if using the current process environment variables
    • setEnvironmentVariables

      public void setEnvironmentVariables(Map<String,String> environmentVariables)
      Sets the environment variables. Clearing any prevously assigned environment variables. Set null to use the current process environment variables (default).
      Parameters:
      environmentVariables - environment variables
    • addEnvironmentVariables

      public void addEnvironmentVariables(Map<String,String> environmentVariables)
      Adds the environment variables, keeping environment variables previously assigned. Existing variables of the same name will be overwritten. To clear all previously assigned variables and use the current process environment variables, pass null to setEnvironmentVariables(Map).
      Parameters:
      environmentVariables - environment variables
    • addEnvironmentVariable

      public void addEnvironmentVariable(String name, String value)
      Adds an environment variables to the list of previously assigned variables (if any). Existing variables of the same name will be overwritten. Setting a variable with a null name has no effect while null values are converted to empty strings.
      Parameters:
      name - environment variable name
      value - environment variable value
    • getMetadataInputFormat

      public String getMetadataInputFormat()
      Gets the format of the metadata input file sent to the external application. One of "json" (default), "xml", or "properties" is expected. Only applicable when the ${INPUT} token is part of the command.
      Returns:
      metadata input format
      Since:
      2.8.0
    • setMetadataInputFormat

      public void setMetadataInputFormat(String metadataInputFormat)
      Sets the format of the metadata input file sent to the external application. One of "json" (default), "xml", or "properties" is expected. Only applicable when the ${INPUT} token is part of the command.
      Parameters:
      metadataInputFormat - format of the metadata input file
      Since:
      2.8.0
    • getMetadataOutputFormat

      public String getMetadataOutputFormat()
      Gets the format of the metadata output file from the external application. By default no format is set, and metadata extraction patterns are used to extract metadata information. One of "json", "xml", or "properties" is expected. Only applicable when the ${OUTPUT} token is part of the command.
      Returns:
      metadata output format
      Since:
      2.8.0
    • setMetadataOutputFormat

      public void setMetadataOutputFormat(String metadataOutputFormat)
      Sets the format of the metadata output file from the external application. One of "json" (default), "xml", or "properties" is expected. Set to null for relying metadata extraction patterns instead. Only applicable when the ${OUTPUT} token is part of the command.
      Parameters:
      metadataOutputFormat - format of the metadata output file
      Since:
      2.8.0
    • parseDocument

      public List<Doc> parseDocument(Doc doc, Writer output) throws DocumentParserException
      Description copied from interface: IDocumentParser
      Parses a document.
      Specified by:
      parseDocument in interface IDocumentParser
      Parameters:
      doc - importer document to parse
      output - where to store extracted or modified content of the supplied document
      Returns:
      a list of first-level embedded documents, if any
      Throws:
      DocumentParserException - problem parsing document
    • loadFromXML

      public void loadFromXML(XML xml)
      Specified by:
      loadFromXML in interface IXMLConfigurable
    • saveToXML

      public void saveToXML(XML xml)
      Specified by:
      saveToXML in interface IXMLConfigurable
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class Object
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • toString

      public String toString()
      Overrides:
      toString in class Object