Class AbstractStringTagger
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.tagger.AbstractDocumentTagger
-
- com.norconex.importer.handler.tagger.AbstractCharStreamTagger
-
- com.norconex.importer.handler.tagger.AbstractStringTagger
-
- All Implemented Interfaces:
IXMLConfigurable,IImporterHandler,IDocumentTagger
- Direct Known Subclasses:
LanguageTagger,RegexTagger,ScriptTagger,TextBetweenTagger,TextPatternTagger,TitleGeneratorTagger
public abstract class AbstractStringTagger extends AbstractCharStreamTagger
Base class to facilitate creating taggers based on text content, loading text into
StringBuilderfor memory processing.Since 2.2.0 this class limits the memory used for analysing content by reading one section of text at a time. Each sections are sent for tagging once they are read, so that no two sections exists in memory at once. Sub-classes should respect this approach. Each of them have a maximum number of characters equal to the maximum read size defined using
setMaxReadSize(int). When none is set, the default read size is defined byTextReader.DEFAULT_MAX_READ_SIZE.An attempt is made to break sections nicely after a paragraph, sentence, or word. When not possible, long text will be cut at a size equal to the maximum read size.
Implementors should be conscious about memory when dealing with the string builder.
XML configuration usage:
maxReadSize="(max characters to read at once)" sourceCharset="(character encoding)"Subclasses inherit the above
IXMLConfigurableattribute(s), in addition to <restrictTo>.- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description AbstractStringTagger()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description booleanequals(Object other)intgetMaxReadSize()Gets the maximum number of characters to read from content for tagging at once.inthashCode()protected voidloadCharStreamTaggerFromXML(XML xml)Loads configuration settings specific to the implementing class.protected abstract voidloadStringTaggerFromXML(XML xml)Loads configuration settings specific to the implementing class.protected voidsaveCharStreamTaggerToXML(XML xml)Saves configuration settings specific to the implementing class.protected abstract voidsaveStringTaggerToXML(XML xml)Saves configuration settings specific to the implementing class.voidsetMaxReadSize(int maxReadSize)Sets the maximum number of characters to read from content for tagging at once.protected abstract voidtagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)protected voidtagTextDocument(HandlerDoc doc, Reader input, ParseState parseState)StringtoString()-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
-
-
-
Method Detail
-
tagTextDocument
protected final void tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState) throws ImporterHandlerException
- Specified by:
tagTextDocumentin classAbstractCharStreamTagger- Throws:
ImporterHandlerException
-
getMaxReadSize
public int getMaxReadSize()
Gets the maximum number of characters to read from content for tagging at once. Default isTextReader.DEFAULT_MAX_READ_SIZE.- Returns:
- maximum read size
-
setMaxReadSize
public void setMaxReadSize(int maxReadSize)
Sets the maximum number of characters to read from content for tagging at once.- Parameters:
maxReadSize- maximum read size
-
tagStringContent
protected abstract void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
- Throws:
ImporterHandlerException
-
saveCharStreamTaggerToXML
protected final void saveCharStreamTaggerToXML(XML xml)
Description copied from class:AbstractCharStreamTaggerSaves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveCharStreamTaggerToXMLin classAbstractCharStreamTagger- Parameters:
xml- the XML
-
saveStringTaggerToXML
protected abstract void saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Parameters:
xml- the XML
-
loadCharStreamTaggerFromXML
protected final void loadCharStreamTaggerFromXML(XML xml)
Description copied from class:AbstractCharStreamTaggerLoads configuration settings specific to the implementing class.- Specified by:
loadCharStreamTaggerFromXMLin classAbstractCharStreamTagger- Parameters:
xml- xml configuration
-
loadStringTaggerFromXML
protected abstract void loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.- Parameters:
xml- xml configuration
-
equals
public boolean equals(Object other)
- Overrides:
equalsin classAbstractCharStreamTagger
-
hashCode
public int hashCode()
- Overrides:
hashCodein classAbstractCharStreamTagger
-
toString
public String toString()
- Overrides:
toStringin classAbstractCharStreamTagger
-
-