TextStatisticsTagger (Norconex Importer 3.0.1 API)

java.lang.Object
- com.norconex.importer.handler.AbstractImporterHandler
- - com.norconex.importer.handler.tagger.AbstractDocumentTagger
  - - com.norconex.importer.handler.tagger.AbstractCharStreamTagger
    - - com.norconex.importer.handler.tagger.impl.TextStatisticsTagger

All Implemented Interfaces:: IXMLConfigurable, IImporterHandler, IDocumentTagger

public class TextStatisticsTagger
extends AbstractCharStreamTagger
implements IXMLConfigurable

Analyzes the content of the supplied document and adds statistical information about its content or field as metadata fields. Default behavior provide the statistics about the content. Refer to the following for the new metadata fields to be created along with their description.

Statistic fields
Field name	Description
document.stat.characterCount	Total number of characters (excluding carriage returns/line feed).
document.stat.wordCount	Total number of words.
document.stat.sentenceCount	Total number of sentences.
document.stat.paragraphCount	Total number of paragraph.
document.stat.averageWordCharacterCount	Average number of character in every words.
document.stat.averageSentenceCharacterCount	Average number of character in sentences (including non-word characters, such as spaces, or slashes).
document.stat.averageSentenceWordCount	Average number of words per sentences.
document.stat.averageParagraphCharacterCount	Average number of characters in paragraphs (including non-word characters, such as spaces, or slashes).
document.stat.averageParagraphSentenceCount	Average number of sentences per paragraphs.
document.stat.averageParagraphWordCount	Average number of words per paragraphs.

You can specify a field matcher to obtain statistics about matching fields instead. When you do so, the field name will be inserted in the above names, right after "document.stat.". E.g.: document.stat.myfield.characterCount

Can be used both as a pre-parse (text-only) or post-parse handler.

XML configuration usage:


<handler
    class="com.norconex.importer.handler.tagger.impl.TextStatisticsTagger"
    sourceCharset="(character encoding)">
  <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
  <restrictTo>
    <fieldMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (field-matching expression)
    </fieldMatcher>
    <valueMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (value-matching expression)
    </valueMatcher>
  </restrictTo>
  <fieldMatcher
      method="[basic|csv|wildcard|regex]"
      ignoreCase="[false|true]"
      ignoreDiacritic="[false|true]"
      partial="[false|true]">
    (optional expression matching source fields to analyze instead of content)
  </fieldMatcher>
</handler>

XML usage example:


<handler
    class="TextStatisticsTagger">
  <fieldMatcher>statistics</fieldMatcher>
</handler>

The above create statistics from the value of a field called "statistics".

Since:: 2.0.0
Author:: Pascal Essiembre

Constructor Summary

Constructors
Constructor and Description

TextStatisticsTagger()

Constructors
Constructor and Description
`TextStatisticsTagger()`

Method Summary

All Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`protected void`	`analyze(Reader input, Properties metadata, String field)`
`boolean`	`equals(Object other)`
`TextMatcher`	`getFieldMatcher()` Gets field matcher for fields to split.
`String`	`getFieldName()` Deprecated. Since 3.0.0, use `getFieldMatcher()`.
`int`	`hashCode()`
`protected void`	`loadCharStreamTaggerFromXML(XML xml)` Loads configuration settings specific to the implementing class.
`protected void`	`saveCharStreamTaggerToXML(XML xml)` Saves configuration settings specific to the implementing class.
`void`	`setFieldMatcher(TextMatcher fieldMatcher)` Sets the field matcher for fields to split.
`void`	`setFieldName(String fieldName)` Deprecated. Since 3.0.0, use `setFieldMatcher(TextMatcher)`.
`protected void`	`tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState)`
`String`	`toString()`

Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument

Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML

- Constructor Detail
  - TextStatisticsTagger
```
public TextStatisticsTagger()
```
- Method Detail
  - tagTextDocument
```
protected void tagTextDocument(HandlerDoc doc,
                               Reader input,
                               ParseState parseState)
                        throws ImporterHandlerException
```
    Specified by:
    
    tagTextDocument in class AbstractCharStreamTagger
    
    Throws:
    
    ImporterHandlerException
  - analyze
```
protected void analyze(Reader input,
                       Properties metadata,
                       String field)
```
  - getFieldName
```
@Deprecated
public String getFieldName()
```
    Deprecated. Since 3.0.0, use getFieldMatcher().
    
    Gets the name of field containing the text to analyze.
    
    Returns:
    
    field name
  - setFieldName
```
@Deprecated
public void setFieldName(String fieldName)
```
    Deprecated. Since 3.0.0, use setFieldMatcher(TextMatcher).
    
    Sets the name of field containing the text to analyze.
    
    Parameters:
    
    fieldName - field name
  - getFieldMatcher
```
public TextMatcher getFieldMatcher()
```
    Gets field matcher for fields to split.
    
    Returns:
    
    field matcher
    
    Since:
    
    3.0.0
  - setFieldMatcher
```
public void setFieldMatcher(TextMatcher fieldMatcher)
```
    Sets the field matcher for fields to split.
    
    Parameters:
    
    fieldMatcher - field matcher
    
    Since:
    
    3.0.0
  - loadCharStreamTaggerFromXML
```
protected void loadCharStreamTaggerFromXML(XML xml)
```
    Description copied from class: AbstractCharStreamTagger
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadCharStreamTaggerFromXML in class AbstractCharStreamTagger
    
    Parameters:
    
    xml - xml configuration
  - saveCharStreamTaggerToXML
```
protected void saveCharStreamTaggerToXML(XML xml)
```
    Description copied from class: AbstractCharStreamTagger
    
    Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.
    
    Specified by:
    
    saveCharStreamTaggerToXML in class AbstractCharStreamTagger
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractCharStreamTagger
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractCharStreamTagger
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractCharStreamTagger

Class TextStatisticsTagger

XML configuration usage:

XML usage example:

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger

Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler

Methods inherited from class java.lang.Object

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable

Constructor Detail

TextStatisticsTagger

Method Detail

tagTextDocument

analyze

getFieldName

setFieldName

getFieldMatcher

setFieldMatcher

loadCharStreamTaggerFromXML

saveCharStreamTaggerToXML

equals

hashCode

toString