Class TextBetweenTagger

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTagger

    public class TextBetweenTagger
    extends AbstractStringTagger
    implements IXMLConfigurable

    Extracts and add values found between a matching start and end strings to a document metadata field. The matching string end-points are defined in pairs and multiple ones can be specified at once. The field specified for a pair of end-points is considered a multi-value field.

    If "fieldMatcher" is specified, it will use content from matching fields and storing all text extracted into the target field, multi-value. Else, the document content is used.

    Storing values in an existing field

    If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a PropertySetter.

    This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger"
        maxReadSize="(max characters to read at once)"
        sourceCharset="(character encoding)">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
      <!-- multiple textBetween tags allowed -->
      <textBetween
          toField="(target field name)"
          inclusive="[false|true]">
        <fieldMatcher>
          (optional expression matching fields to perform extraction on)
        </fieldMatcher>
        <startMatcher>(expression matching "left" delimiter)</startMatcher>
        <endMatcher>(expression matching "right" delimiter)</endMatcher>
      </textBetween>
    </handler>

    XML usage example:

    
    <handler
        class="TextBetweenTagger">
      <textBetween
          toField="content">
        <startMatcher>OPEN</startMatcher>
        <endMatcher>CLOSE</endMatcher>
      </textBetween>
    </handler>

    The above example extract the content between "OPEN" and "CLOSE" strings, excluding these strings, and store it in a "content" field.

    Author:
    Pascal Essiembre