Class ExternalTransformer

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTransformer

    public class ExternalTransformer
    extends AbstractDocumentTransformer

    Transforms a document using an external application to do so.

    This class relies on ExternalHandler for most of the work. Refer to ExternalHandler for full documentation.

    To parse/extract raw text from files, it is recommended to use ExternalParser instead.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
      <command>
        c:\Apps\myapp.exe ${INPUT} ${OUTPUT} ${INPUT_META} ${OUTPUT_META}
        ${REFERENCE}
      </command>
      <metadata
          inputFormat="[json|xml|properties]"
          outputFormat="[json|xml|properties]">
        <!--
          Pattern only used when no output format is specified.
                    Repeat as needed.
          -->
        <pattern>(regular expression)</pattern>
      </metadata>
      <environment>
        <!-- repeat variable tag as needed -->
        <variable
            name="(environment variable name)">
          (environment variable value)
        </variable>
      </environment>
      <tempDir>
        (Optional directory where to store temporary files used
         for transformation.)
      </tempDir>
    </handler>

    XML usage example:

    
    <handler
        class="ExternalTransformer">
      <command>/path/transform/app ${INPUT} ${OUTPUT}</command>
      <metadata>
        <pattern
            field="docnumber"
            valueGroup="1">
          DocNo:(\d+)
        </pattern>
      </metadata>
    </handler>

    The above example invokes an external application that accepts two files as arguments: the first one being the file to transform, the second one being holding the transformation result. It also extract a document number from STDOUT, found as "DocNo:1234" and storing it as "docnumber".

    Since:
    2.7.0
    Author:
    Pascal Essiembre
    See Also:
    ExternalHandler
    • Constructor Detail

      • ExternalTransformer

        public ExternalTransformer()
    • Method Detail

      • getCommand

        public String getCommand()
        Gets the command to execute.
        Returns:
        the command
      • setCommand

        public void setCommand​(String command)
        Sets the command to execute. Make sure to escape spaces in executable path and its arguments as well as other special command line characters.
        Parameters:
        command - the command
      • getMetadataExtractionPatterns

        public List<RegexFieldValueExtractor> getMetadataExtractionPatterns()
        Gets metadata extraction patterns. See class documentation.
        Returns:
        map of patterns and field names
      • addMetadataExtractionPattern

        public void addMetadataExtractionPattern​(String field,
                                                 String pattern)
        Adds a metadata extraction pattern that will extract the whole text matched into the given field.
        Parameters:
        field - target field to store the matching pattern.
        pattern - the pattern
      • addMetadataExtractionPattern

        public void addMetadataExtractionPattern​(String field,
                                                 String pattern,
                                                 int valueGroup)
        Adds a metadata extraction pattern, which will extract the value from the specified group index upon matching.
        Parameters:
        field - target field to store the matching pattern.
        pattern - the pattern
        valueGroup - which pattern group to return.
      • addMetadataExtractionPatterns

        public void addMetadataExtractionPatterns​(RegexFieldValueExtractor... patterns)
        Adds a metadata extraction pattern that will extract matching field names/values.
        Parameters:
        patterns - extraction pattern
      • setMetadataExtractionPatterns

        public void setMetadataExtractionPatterns​(RegexFieldValueExtractor... patterns)
        Sets metadata extraction patterns. Clears any previously assigned patterns.
        Parameters:
        patterns - extraction pattern
      • getEnvironmentVariables

        public Map<String,​String> getEnvironmentVariables()
        Gets environment variables.
        Returns:
        environment variables or null if using the current process environment variables
      • setEnvironmentVariables

        public void setEnvironmentVariables​(Map<String,​String> environmentVariables)
        Sets the environment variables. Clearing any prevously assigned environment variables. Set null to use the current process environment variables (default).
        Parameters:
        environmentVariables - environment variables
      • addEnvironmentVariables

        public void addEnvironmentVariables​(Map<String,​String> environmentVariables)
        Adds the environment variables, keeping environment variables previously assigned. Existing variables of the same name will be overwritten. To clear all previously assigned variables and use the current process environment variables, pass null to setEnvironmentVariables(Map).
        Parameters:
        environmentVariables - environment variables
      • addEnvironmentVariable

        public void addEnvironmentVariable​(String name,
                                           String value)
        Adds an environment variables to the list of previously assigned variables (if any). Existing variables of the same name will be overwritten. Setting a variable with a null name has no effect while null values are converted to empty strings.
        Parameters:
        name - environment variable name
        value - environment variable value
      • getMetadataInputFormat

        public String getMetadataInputFormat()
        Gets the format of the metadata input file sent to the external application. One of "json" (default), "xml", or "properties" is expected. Only applicable when the ${INPUT} token is part of the command.
        Returns:
        metadata input format
      • setMetadataInputFormat

        public void setMetadataInputFormat​(String metadataInputFormat)
        Sets the format of the metadata input file sent to the external application. One of "json" (default), "xml", or "properties" is expected. Only applicable when the ${INPUT} token is part of the command.
        Parameters:
        metadataInputFormat - format of the metadata input file
      • getMetadataOutputFormat

        public String getMetadataOutputFormat()
        Gets the format of the metadata output file from the external application. By default no format is set, and metadata extraction patterns are used to extract metadata information. One of "json", "xml", or "properties" is expected. Only applicable when the ${OUTPUT} token is part of the command.
        Returns:
        metadata output format
      • setMetadataOutputFormat

        public void setMetadataOutputFormat​(String metadataOutputFormat)
        Sets the format of the metadata output file from the external application. One of "json" (default), "xml", or "properties" is expected. Set to null for relying metadata extraction patterns instead. Only applicable when the ${OUTPUT} token is part of the command.
        Parameters:
        metadataOutputFormat - format of the metadata output file
      • getOnSet

        public PropertySetter getOnSet()
        Gets the property setter to use when a metadata value is set.
        Returns:
        property setter
        Since:
        3.0.0
      • setOnSet

        public void setOnSet​(PropertySetter onSet)
        Sets the property setter to use when a metadata value is set.
        Parameters:
        onSet - property setter
        Since:
        3.0.0
      • getTempDir

        public Path getTempDir()
        Gets directory where to store temporary files used for transformation.
        Returns:
        temporary directory
      • setTempDir

        public void setTempDir​(Path tempDir)
        Sets directory where to store temporary files used for transformation.
        Parameters:
        tempDir - temporary directory