Class AbstractStringTagger
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.tagger.AbstractDocumentTagger
-
- com.norconex.importer.handler.tagger.AbstractCharStreamTagger
-
- com.norconex.importer.handler.tagger.AbstractStringTagger
-
- All Implemented Interfaces:
IXMLConfigurable
,IImporterHandler
,IDocumentTagger
- Direct Known Subclasses:
LanguageTagger
,RegexTagger
,ScriptTagger
,TextBetweenTagger
,TextPatternTagger
,TitleGeneratorTagger
public abstract class AbstractStringTagger extends AbstractCharStreamTagger
Base class to facilitate creating taggers based on text content, loading text into
StringBuilder
for memory processing.Since 2.2.0 this class limits the memory used for analysing content by reading one section of text at a time. Each sections are sent for tagging once they are read, so that no two sections exists in memory at once. Sub-classes should respect this approach. Each of them have a maximum number of characters equal to the maximum read size defined using
setMaxReadSize(int)
. When none is set, the default read size is defined byTextReader.DEFAULT_MAX_READ_SIZE
.An attempt is made to break sections nicely after a paragraph, sentence, or word. When not possible, long text will be cut at a size equal to the maximum read size.
Implementors should be conscious about memory when dealing with the string builder.
XML configuration usage:
maxReadSize="(max characters to read at once)" sourceCharset="(character encoding)"
Subclasses inherit the above
IXMLConfigurable
attribute(s), in addition to <restrictTo>.- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description AbstractStringTagger()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
int
getMaxReadSize()
Gets the maximum number of characters to read from content for tagging at once.int
hashCode()
protected void
loadCharStreamTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected abstract void
loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveCharStreamTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.protected abstract void
saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setMaxReadSize(int maxReadSize)
Sets the maximum number of characters to read from content for tagging at once.protected abstract void
tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)
protected void
tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState)
String
toString()
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
-
-
-
Method Detail
-
tagTextDocument
protected final void tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState) throws ImporterHandlerException
- Specified by:
tagTextDocument
in classAbstractCharStreamTagger
- Throws:
ImporterHandlerException
-
getMaxReadSize
public int getMaxReadSize()
Gets the maximum number of characters to read from content for tagging at once. Default isTextReader.DEFAULT_MAX_READ_SIZE
.- Returns:
- maximum read size
-
setMaxReadSize
public void setMaxReadSize(int maxReadSize)
Sets the maximum number of characters to read from content for tagging at once.- Parameters:
maxReadSize
- maximum read size
-
tagStringContent
protected abstract void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
- Throws:
ImporterHandlerException
-
saveCharStreamTaggerToXML
protected final void saveCharStreamTaggerToXML(XML xml)
Description copied from class:AbstractCharStreamTagger
Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveCharStreamTaggerToXML
in classAbstractCharStreamTagger
- Parameters:
xml
- the XML
-
saveStringTaggerToXML
protected abstract void saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Parameters:
xml
- the XML
-
loadCharStreamTaggerFromXML
protected final void loadCharStreamTaggerFromXML(XML xml)
Description copied from class:AbstractCharStreamTagger
Loads configuration settings specific to the implementing class.- Specified by:
loadCharStreamTaggerFromXML
in classAbstractCharStreamTagger
- Parameters:
xml
- xml configuration
-
loadStringTaggerFromXML
protected abstract void loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.- Parameters:
xml
- xml configuration
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractCharStreamTagger
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractCharStreamTagger
-
toString
public String toString()
- Overrides:
toString
in classAbstractCharStreamTagger
-
-