public class TextStatisticsTagger extends AbstractCharStreamTagger implements IXMLConfigurable
Analyzes the content of the supplied document and adds statistical information about its content or field as metadata fields. Default behavior provide the statistics about the content. Refer to the following for the new metadata fields to be created along with their description.
Field name | Description |
---|---|
document.stat.characterCount | Total number of characters (excluding carriage returns/line feed). |
document.stat.wordCount | Total number of words. |
document.stat.sentenceCount | Total number of sentences. |
document.stat.paragraphCount | Total number of paragraph. |
document.stat.averageWordCharacterCount | Average number of character in every words. |
document.stat.averageSentenceCharacterCount | Average number of character in sentences (including non-word characters, such as spaces, or slashes). |
document.stat.averageSentenceWordCount | Average number of words per sentences. |
document.stat.averageParagraphCharacterCount | Average number of characters in paragraphs (including non-word characters, such as spaces, or slashes). |
document.stat.averageParagraphSentenceCount | Average number of sentences per paragraphs. |
document.stat.averageParagraphWordCount | Average number of words per paragraphs. |
You can specify a field matcher to obtain statistics about matching
fields instead.
When you do so, the field name will be inserted in the above
names, right after "document.stat.". E.g.:
document.stat.myfield.characterCount
Can be used both as a pre-parse (text-only) or post-parse handler.
<handler
class="com.norconex.importer.handler.tagger.impl.TextStatisticsTagger"
sourceCharset="(character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(optional expression matching source fields to analyze instead of content)
</fieldMatcher>
</handler>
<handler
class="TextStatisticsTagger">
<fieldMatcher>statistics</fieldMatcher>
</handler>
The above create statistics from the value of a field called "statistics".
Constructor and Description |
---|
TextStatisticsTagger() |
Modifier and Type | Method and Description |
---|---|
protected void |
analyze(Reader input,
Properties metadata,
String field) |
boolean |
equals(Object other) |
TextMatcher |
getFieldMatcher()
Gets field matcher for fields to split.
|
String |
getFieldName()
Deprecated.
Since 3.0.0, use
getFieldMatcher() . |
int |
hashCode() |
protected void |
loadCharStreamTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveCharStreamTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setFieldMatcher(TextMatcher fieldMatcher)
Sets the field matcher for fields to split.
|
void |
setFieldName(String fieldName)
Deprecated.
Since 3.0.0, use
setFieldMatcher(TextMatcher) . |
protected void |
tagTextDocument(HandlerDoc doc,
Reader input,
ParseState parseState) |
String |
toString() |
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState) throws ImporterHandlerException
tagTextDocument
in class AbstractCharStreamTagger
ImporterHandlerException
protected void analyze(Reader input, Properties metadata, String field)
@Deprecated public String getFieldName()
getFieldMatcher()
.@Deprecated public void setFieldName(String fieldName)
setFieldMatcher(TextMatcher)
.fieldName
- field namepublic TextMatcher getFieldMatcher()
public void setFieldMatcher(TextMatcher fieldMatcher)
fieldMatcher
- field matcherprotected void loadCharStreamTaggerFromXML(XML xml)
AbstractCharStreamTagger
loadCharStreamTaggerFromXML
in class AbstractCharStreamTagger
xml
- xml configurationprotected void saveCharStreamTaggerToXML(XML xml)
AbstractCharStreamTagger
saveCharStreamTaggerToXML
in class AbstractCharStreamTagger
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractCharStreamTagger
public int hashCode()
hashCode
in class AbstractCharStreamTagger
public String toString()
toString
in class AbstractCharStreamTagger
Copyright © 2009–2023 Norconex Inc.. All rights reserved.