Class TextStatisticsTagger
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.tagger.AbstractDocumentTagger
-
- com.norconex.importer.handler.tagger.AbstractCharStreamTagger
-
- com.norconex.importer.handler.tagger.impl.TextStatisticsTagger
-
- All Implemented Interfaces:
IXMLConfigurable
,IImporterHandler
,IDocumentTagger
public class TextStatisticsTagger extends AbstractCharStreamTagger implements IXMLConfigurable
Analyzes the content of the supplied document and adds statistical information about its content or field as metadata fields. Default behavior provide the statistics about the content. Refer to the following for the new metadata fields to be created along with their description.
Statistic fields Field name Description document.stat.characterCount Total number of characters (excluding carriage returns/line feed). document.stat.wordCount Total number of words. document.stat.sentenceCount Total number of sentences. document.stat.paragraphCount Total number of paragraph. document.stat.averageWordCharacterCount Average number of character in every words. document.stat.averageSentenceCharacterCount Average number of character in sentences (including non-word characters, such as spaces, or slashes). document.stat.averageSentenceWordCount Average number of words per sentences. document.stat.averageParagraphCharacterCount Average number of characters in paragraphs (including non-word characters, such as spaces, or slashes). document.stat.averageParagraphSentenceCount Average number of sentences per paragraphs. document.stat.averageParagraphWordCount Average number of words per paragraphs. You can specify a field matcher to obtain statistics about matching fields instead. When you do so, the field name will be inserted in the above names, right after "document.stat.". E.g.:
document.stat.myfield.characterCount
Can be used both as a pre-parse (text-only) or post-parse handler.
XML configuration usage:
<handler class="com.norconex.importer.handler.tagger.impl.TextStatisticsTagger" sourceCharset="(character encoding)"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> <fieldMatcher> (optional expression matching source fields to analyze instead of content) </fieldMatcher> </handler>
XML usage example:
<handler class="TextStatisticsTagger"> <fieldMatcher>statistics</fieldMatcher> </handler>
The above create statistics from the value of a field called "statistics".
- Since:
- 2.0.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description TextStatisticsTagger()
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description protected void
analyze(Reader input, Properties metadata, String field)
boolean
equals(Object other)
TextMatcher
getFieldMatcher()
Gets field matcher for fields to split.String
getFieldName()
Deprecated.Since 3.0.0, usegetFieldMatcher()
.int
hashCode()
protected void
loadCharStreamTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveCharStreamTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setFieldMatcher(TextMatcher fieldMatcher)
Sets the field matcher for fields to split.void
setFieldName(String fieldName)
Deprecated.Since 3.0.0, usesetFieldMatcher(TextMatcher)
.protected void
tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState)
String
toString()
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML
-
-
-
-
Method Detail
-
tagTextDocument
protected void tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState) throws ImporterHandlerException
- Specified by:
tagTextDocument
in classAbstractCharStreamTagger
- Throws:
ImporterHandlerException
-
analyze
protected void analyze(Reader input, Properties metadata, String field)
-
getFieldName
@Deprecated public String getFieldName()
Deprecated.Since 3.0.0, usegetFieldMatcher()
.Gets the name of field containing the text to analyze.- Returns:
- field name
-
setFieldName
@Deprecated public void setFieldName(String fieldName)
Deprecated.Since 3.0.0, usesetFieldMatcher(TextMatcher)
.Sets the name of field containing the text to analyze.- Parameters:
fieldName
- field name
-
getFieldMatcher
public TextMatcher getFieldMatcher()
Gets field matcher for fields to split.- Returns:
- field matcher
- Since:
- 3.0.0
-
setFieldMatcher
public void setFieldMatcher(TextMatcher fieldMatcher)
Sets the field matcher for fields to split.- Parameters:
fieldMatcher
- field matcher- Since:
- 3.0.0
-
loadCharStreamTaggerFromXML
protected void loadCharStreamTaggerFromXML(XML xml)
Description copied from class:AbstractCharStreamTagger
Loads configuration settings specific to the implementing class.- Specified by:
loadCharStreamTaggerFromXML
in classAbstractCharStreamTagger
- Parameters:
xml
- xml configuration
-
saveCharStreamTaggerToXML
protected void saveCharStreamTaggerToXML(XML xml)
Description copied from class:AbstractCharStreamTagger
Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveCharStreamTaggerToXML
in classAbstractCharStreamTagger
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractCharStreamTagger
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractCharStreamTagger
-
toString
public String toString()
- Overrides:
toString
in classAbstractCharStreamTagger
-
-