public abstract class AbstractStringTagger extends AbstractCharStreamTagger
Base class to facilitate creating taggers based on text content, loading
text into StringBuilder
for memory processing.
Since 2.2.0 this class limits the memory used for analysing
content by reading one section of text at a time. Each
sections are sent for tagging once they are read,
so that no two sections exists in memory at once. Sub-classes should
respect this approach. Each of them have a maximum number of characters
equal to the maximum read size defined using setMaxReadSize(int)
.
When none is set, the default read size is defined by
TextReader.DEFAULT_MAX_READ_SIZE
.
An attempt is made to break sections nicely after a paragraph, sentence, or word. When not possible, long text will be cut at a size equal to the maximum read size.
Implementors should be conscious about memory when dealing with the string builder.
maxReadSize="(max characters to read at once)"
sourceCharset="(character encoding)"
Subclasses inherit the above IXMLConfigurable
attribute(s),
in addition to
<restrictTo>.
Constructor and Description |
---|
AbstractStringTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
int |
getMaxReadSize()
Gets the maximum number of characters to read from content for tagging
at once.
|
int |
hashCode() |
protected void |
loadCharStreamTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected abstract void |
loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveCharStreamTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
protected abstract void |
saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setMaxReadSize(int maxReadSize)
Sets the maximum number of characters to read from content for tagging
at once.
|
protected abstract void |
tagStringContent(HandlerDoc doc,
StringBuilder content,
ParseState parseState,
int sectionIndex) |
protected void |
tagTextDocument(HandlerDoc doc,
Reader input,
ParseState parseState) |
String |
toString() |
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
protected final void tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState) throws ImporterHandlerException
tagTextDocument
in class AbstractCharStreamTagger
ImporterHandlerException
public int getMaxReadSize()
TextReader.DEFAULT_MAX_READ_SIZE
.public void setMaxReadSize(int maxReadSize)
maxReadSize
- maximum read sizeprotected abstract void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
ImporterHandlerException
protected final void saveCharStreamTaggerToXML(XML xml)
AbstractCharStreamTagger
saveCharStreamTaggerToXML
in class AbstractCharStreamTagger
xml
- the XMLprotected abstract void saveStringTaggerToXML(XML xml)
xml
- the XMLprotected final void loadCharStreamTaggerFromXML(XML xml)
AbstractCharStreamTagger
loadCharStreamTaggerFromXML
in class AbstractCharStreamTagger
xml
- xml configurationprotected abstract void loadStringTaggerFromXML(XML xml)
xml
- xml configurationpublic boolean equals(Object other)
equals
in class AbstractCharStreamTagger
public int hashCode()
hashCode
in class AbstractCharStreamTagger
public String toString()
toString
in class AbstractCharStreamTagger
Copyright © 2009–2023 Norconex Inc.. All rights reserved.