Class CountMatchesTagger

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTagger

    public class CountMatchesTagger
    extends AbstractCharStreamTagger

    Counts the number of matches of a given string (or string pattern) and store the resulting value in a field in the specified "toField".

    If no "fieldMatcher" expression is specified, the document content will be used. If the "fieldMatcher" matches more than one field, the sum of all matches will be stored as a single value. More often than not, you probably want to set your "countMatcher" to "partial".

    Storing values in an existing field

    If a target field with the same name already exists for a document, the count value will be added to the end of the existing value list. It is possible to change this default behavior with setOnSet(PropertySetter).

    Can be used as a pre-parse tagger on text document only when matching strings on document content, or both as a pre-parse or post-parse handler when the "fieldMatcher" is used.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.tagger.impl.CountMatchesTagger"
        toField="(target field)"
        maxReadSize="(max characters to read at once)"
        sourceCharset="(character encoding)">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
      <fieldMatcher>
        (optional expression for fields used to count matches)
      </fieldMatcher>
      <countMatcher>(expression used to count matches)</countMatcher>
    </handler>

    XML usage example:

    
    <handler
        class="CountMatchesTagger"
        toField="urlSegmentCount">
      <fieldMatcher>document.reference</fieldMatcher>
      <countMatcher
          method="regex">
        /[^/]+
      </countMatcher>
    </handler>

    The above will count the number of segments in a URL.

    Since:
    2.6.0
    Author:
    Pascal Essiembre
    See Also:
    Pattern