Class SplitTagger

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTagger

    public class SplitTagger
    extends AbstractCharStreamTagger

    Splits an existing metadata value into multiple values based on a given value separator (the separator gets discarded). The "toField" argument is optional (the same field will be used to store the splits if no "toField" is specified"). Duplicates are removed.

    Can be used both as a pre-parse (metadata or text content) or post-parse handler.

    If no "fieldMatcher" expression is specified, the document content will be used. If the "fieldMatcher" matches more than one field, they will all be split and stored in the same multi-value metadata field.

    Storing values in an existing field

    If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a PropertySetter.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.tagger.impl.SplitTagger"
        sourceCharset="(character encoding)">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
      <!-- multiple split tags allowed -->
      <split
          toField="targetFieldName">
        <fieldMatcher>(one or more matching fields to split)</fieldMatcher>
        <separator
            regex="[false|true]">
          (separator value)
        </separator>
      </split>
    </handler>

    XML usage example:

    
    <handler
        class="SplitTagger">
      <split>
        <fieldMatcher>myField</fieldMatcher>
        <separator
            regex="true">
          \s*,\s*
        </separator>
      </split>
    </handler>

    The above example splits a single value field holding a comma-separated list into multiple values.

    Since:
    1.3.0
    Author:
    Pascal Essiembre