Class AbstractStringTagger

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTagger
    Direct Known Subclasses:
    LanguageTagger, RegexTagger, ScriptTagger, TextBetweenTagger, TextPatternTagger, TitleGeneratorTagger

    public abstract class AbstractStringTagger
    extends AbstractCharStreamTagger

    Base class to facilitate creating taggers based on text content, loading text into StringBuilder for memory processing.

    Since 2.2.0 this class limits the memory used for analysing content by reading one section of text at a time. Each sections are sent for tagging once they are read, so that no two sections exists in memory at once. Sub-classes should respect this approach. Each of them have a maximum number of characters equal to the maximum read size defined using setMaxReadSize(int). When none is set, the default read size is defined by TextReader.DEFAULT_MAX_READ_SIZE.

    An attempt is made to break sections nicely after a paragraph, sentence, or word. When not possible, long text will be cut at a size equal to the maximum read size.

    Implementors should be conscious about memory when dealing with the string builder.

    XML configuration usage:

    
    maxReadSize="(max characters to read at once)"
       sourceCharset="(character encoding)"

    Subclasses inherit the above IXMLConfigurable attribute(s), in addition to <restrictTo>.

    Author:
    Pascal Essiembre