public class TextStatisticsTagger extends AbstractCharStreamTagger implements IXMLConfigurable
Analyzes the content of the supplied document and adds statistical information about its content or field as metadata fields. Default behavior provide the statistics about the content. Refer to the following for the new metadata fields to be created along with their description.
Field name | Description |
---|---|
document.stat.characterCount | Total number of characters (excluding carriage returns/line feed). |
document.stat.wordCount | Total number of words. |
document.stat.sentenceCount | Total number of sentences. |
document.stat.paragraphCount | Total number of paragraph. |
document.stat.averageWordCharacterCount | Average number of character in every words. |
document.stat.averageSentenceCharacterCount | Average number of character in sentences (including non-word characters, such as spaces, or slashes). |
document.stat.averageSentenceWordCount | Average number of words per sentences. |
document.stat.averageParagraphCharacterCount | Average number of characters in paragraphs (including non-word characters, such as spaces, or slashes). |
document.stat.averageParagraphSentenceCount | Average number of sentences per paragraphs. |
document.stat.averageParagraphWordCount | Average number of words per paragraphs. |
You can specify a field name to obtain statistics about that field instead.
When you do so, the field name will be inserted in the above
names, right after "document.stat.". E.g.:
document.stat.myfield.characterCount
Can be used both as a pre-parse (text-only) or post-parse handler.
<tagger class="com.norconex.importer.handler.tagger.impl.TextStatisticsTagger" sourceCharset="(character encoding)" fieldName="(optional field name instead of using content)" > <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> </tagger>
The following store the statistics in a field called "statistics".
<tagger class="com.norconex.importer.handler.tagger.impl.TextStatisticsTagger" fieldName="statistics" />
Constructor and Description |
---|
TextStatisticsTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getFieldName() |
int |
hashCode() |
protected void |
loadCharStreamTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveCharStreamTaggerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setFieldName(String fieldName) |
protected void |
tagTextDocument(String reference,
Reader input,
ImporterMetadata metadata,
boolean parsed) |
String |
toString() |
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagTextDocument(String reference, Reader input, ImporterMetadata metadata, boolean parsed) throws ImporterHandlerException
tagTextDocument
in class AbstractCharStreamTagger
ImporterHandlerException
public String getFieldName()
public void setFieldName(String fieldName)
protected void loadCharStreamTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractCharStreamTagger
loadCharStreamTaggerFromXML
in class AbstractCharStreamTagger
xml
- xml configurationIOException
- could not load from XMLprotected void saveCharStreamTaggerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractCharStreamTagger
saveCharStreamTaggerToXML
in class AbstractCharStreamTagger
writer
- the xml writerXMLStreamException
- could not save to XMLpublic boolean equals(Object other)
equals
in class AbstractCharStreamTagger
public int hashCode()
hashCode
in class AbstractCharStreamTagger
public String toString()
toString
in class AbstractCharStreamTagger
Copyright © 2009–2021 Norconex Inc.. All rights reserved.