Class TextStatisticsTagger

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTagger

    public class TextStatisticsTagger
    extends AbstractCharStreamTagger
    implements IXMLConfigurable

    Analyzes the content of the supplied document and adds statistical information about its content or field as metadata fields. Default behavior provide the statistics about the content. Refer to the following for the new metadata fields to be created along with their description.

    Statistic fields
    Field name Description
    document.stat.characterCount Total number of characters (excluding carriage returns/line feed).
    document.stat.wordCount Total number of words.
    document.stat.sentenceCount Total number of sentences.
    document.stat.paragraphCount Total number of paragraph.
    document.stat.averageWordCharacterCount Average number of character in every words.
    document.stat.averageSentenceCharacterCount Average number of character in sentences (including non-word characters, such as spaces, or slashes).
    document.stat.averageSentenceWordCount Average number of words per sentences.
    document.stat.averageParagraphCharacterCount Average number of characters in paragraphs (including non-word characters, such as spaces, or slashes).
    document.stat.averageParagraphSentenceCount Average number of sentences per paragraphs.
    document.stat.averageParagraphWordCount Average number of words per paragraphs.

    You can specify a field matcher to obtain statistics about matching fields instead. When you do so, the field name will be inserted in the above names, right after "document.stat.". E.g.: document.stat.myfield.characterCount

    Can be used both as a pre-parse (text-only) or post-parse handler.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.tagger.impl.TextStatisticsTagger"
        sourceCharset="(character encoding)">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
      <fieldMatcher>
        (optional expression matching source fields to analyze instead of content)
      </fieldMatcher>
    </handler>

    XML usage example:

    
    <handler
        class="TextStatisticsTagger">
      <fieldMatcher>statistics</fieldMatcher>
    </handler>

    The above create statistics from the value of a field called "statistics".

    Since:
    2.0.0
    Author:
    Pascal Essiembre