Class TextPatternTagger

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTagger

    @Deprecated
    public class TextPatternTagger
    extends AbstractStringTagger
    implements IXMLConfigurable
    Deprecated.
    Since 3.0.0, use RegexTagger.

    Extracts and add all text values matching the regular expression provided in to a field provided explicitly, or also matching a regular expression. The target field is considered a multi-value field.

    It is possible to extract both the field names and their values with regular expression. This is done by using match groups in your regular expressions (parenthesis). For each pattern you define, you can specify which match group hold the field name and which one holds the value. Specifying a field match group is optional if a field is provided. If no match groups are specified, a field is expected.

    Storing values in an existing field

    If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a PropertySetter.

    This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.tagger.impl.TextPatternTagger"
        sourceCharset="(character encoding)"
        maxReadSize="(max characters to read at once)">
      <restrictTo
          caseSensitive="[false|true]"
          field="(name of header/metadata field name to match)">
        (regular expression of value to match)
      </restrictTo>
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <pattern
          toField="(target field name)"
          fieldGroup="(field name match group index)"
          valueGroup="(field value match group index)"
          ignoreCase="[false|true]"
          ignoreDiacritic="[false|true]"
          onSet="[append|prepend|replace|optional]">
        (regular expression)
      </pattern>
      <!-- multiple pattern tags allowed -->
    </handler>

    Usage example:

    The first pattern in the following example extracts what look like email addresses in to an "email" field (simplified regex). The second pattern extracts field names and values from "label" and "value" cells on a given HTML table:

    XML usage example:

    
    <handler
        class="TextPatternTagger">
      <pattern
          field="emails">
        [A-Za-z0-9+_.-]+?@[a-zA-Z0-9.-]+
      </pattern>
      <pattern
          fieldGroup="1"
          valueGroup="2">
        <![CDATA[
            <tr><td class="label">(.*?)</td><td class="value">(.*?)</td></tr>
          ]]>
      </pattern>
    </handler>
    Since:
    2.3.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • TextPatternTagger

        public TextPatternTagger()
        Deprecated.
    • Method Detail

      • addPattern

        public void addPattern​(String field,
                               String pattern)
        Deprecated.
        Adds a pattern that will extract the whole text matched into given field.
        Parameters:
        field - target field to store the matching pattern.
        pattern - the pattern
      • addPattern

        public void addPattern​(String field,
                               String pattern,
                               int valueGroup)
        Deprecated.
        Adds a new pattern, which will extract the value from the specified group index upon matching.
        Parameters:
        field - target field to store the matching pattern.
        pattern - the pattern
        valueGroup - which pattern group to return.
      • addPattern

        public void addPattern​(RegexFieldValueExtractor... pattern)
        Deprecated.
        Adds one or more pattern that will extract matching field names/values.
        Parameters:
        pattern - field extractor pattern
      • setPattern

        public void setPattern​(RegexFieldValueExtractor... patterns)
        Deprecated.
        Sets one or more patterns that will extract matching field names/values. Clears previously set pattterns.
        Parameters:
        patterns - field extractor pattern
      • getPatterns

        public List<RegexFieldValueExtractor> getPatterns()
        Deprecated.
        Gets the patterns used to extract matching field names/values.
        Returns:
        patterns
      • saveStringTaggerToXML

        protected void saveStringTaggerToXML​(XML xml)
        Deprecated.
        Description copied from class: AbstractStringTagger
        Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.
        Specified by:
        saveStringTaggerToXML in class AbstractStringTagger
        Parameters:
        xml - the XML