Class RegexTagger

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTagger

    public class RegexTagger
    extends AbstractStringTagger
    implements IXMLConfigurable

    Extracts field names and their values with regular expression. This is done by using match groups in your regular expressions (parenthesis). For each pattern you define, you can specify which match group hold the field name and which one holds the value. Specifying a field match group is optional if a field is provided. If no match groups are specified, a field is expected.

    If "fieldMatcher" is specified, it will use content from matching fields and storing all text extracted into the target field, multi-value. Else, the document content is used.

    Storing values in an existing field

    If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a PropertySetter.

    This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.tagger.impl.RegexTagger"
        maxReadSize="(max characters to read at once)"
        sourceCharset="(character encoding)">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
      <fieldMatcher>
        (optional expression matching source fields on which to perform extraction)
      </fieldMatcher>
      <!-- multiple pattern tags allowed -->
      <pattern>(regular expression)</pattern>
    </handler>

    XML usage example:

    
    <handler
        class="RegexTagger">
      <pattern
          toField="emails">
        [A-Za-z0-9+_.-]+?@[a-zA-Z0-9.-]+
      </pattern>
      <pattern
          fieldGroup="1"
          valueGroup="2">
        <![CDATA[
         <tr><td class="label">(.*?)</td><td class="value">(.*?)</td></tr>
       ]]>
      </pattern>
    </handler>

    The first pattern in the above example extracts what look like email addresses in to an "email" field (simplified regex). The second pattern extracts field names and values from "label" and "value" cells on a given HTML table.

    Since:
    3.0.0, based on former TextPatternTagger
    Author:
    Pascal Essiembre
    See Also:
    RegexFieldValueExtractor
    • Constructor Detail

      • RegexTagger

        public RegexTagger()
    • Method Detail

      • addPattern

        public void addPattern​(String field,
                               String pattern)
        Adds a pattern that will extract the whole text matched into given field.
        Parameters:
        field - target field to store the matching pattern.
        pattern - the pattern
      • addPattern

        public void addPattern​(String field,
                               String pattern,
                               int valueGroup)
        Adds a new pattern, which will extract the value from the specified group index upon matching.
        Parameters:
        field - target field to store the matching pattern.
        pattern - the pattern
        valueGroup - which pattern group to return.
      • addPattern

        public void addPattern​(RegexFieldValueExtractor... pattern)
        Adds one or more pattern that will extract matching field names/values.
        Parameters:
        pattern - field extractor pattern
      • setPattern

        public void setPattern​(RegexFieldValueExtractor... patterns)
        Sets one or more patterns that will extract matching field names/values. Clears previously set pattterns.
        Parameters:
        patterns - field extractor pattern
      • getPatterns

        public List<RegexFieldValueExtractor> getPatterns()
        Gets the patterns used to extract matching field names/values.
        Returns:
        patterns
      • getFieldMatcher

        public TextMatcher getFieldMatcher()
        Gets source field matcher for fields on which to extract fields/values.
        Returns:
        field matcher
      • setFieldMatcher

        public void setFieldMatcher​(TextMatcher fieldMatcher)
        Sets source field matcher for fields on which to extract fields/values.
        Parameters:
        fieldMatcher - field matcher
      • saveStringTaggerToXML

        protected void saveStringTaggerToXML​(XML xml)
        Description copied from class: AbstractStringTagger
        Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.
        Specified by:
        saveStringTaggerToXML in class AbstractStringTagger
        Parameters:
        xml - the XML