Class AbstractStringFilter
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.filter.AbstractDocumentFilter
-
- com.norconex.importer.handler.filter.AbstractCharStreamFilter
-
- com.norconex.importer.handler.filter.AbstractStringFilter
-
- All Implemented Interfaces:
IXMLConfigurable
,IDocumentFilter
,IOnMatchFilter
,IImporterHandler
- Direct Known Subclasses:
RegexContentFilter
,ScriptFilter
,TextFilter
public abstract class AbstractStringFilter extends AbstractCharStreamFilter
Base class to facilitate creating filters based on text content, loading text into
StringBuilder
for memory processing.Since 2.2.0 this class limits the memory used for content filtering by reading one section of text at a time. Each sections are sent for filtering once they are read until a match is found. No two sections exists in memory at once. Sub-classes should respect this approach. Each section have a maximum number of characters equal to the maximum read size defined using
setMaxReadSize(int)
. When none is set, the default read size is defined byTextReader.DEFAULT_MAX_READ_SIZE
.An attempt is made to break sections nicely after a paragraph, sentence, or word. When not possible, long text will be cut at a size equal to the maximum read size.
Since 3.0.0 the
isStringContentMatching(HandlerDoc, StringBuilder, ParseState, int)
method is invoked at least once, even if there is no content. This gives subclasses a chance to act on metadata even if there is no content.Implementors should be conscious about memory when dealing with the string builder.
XML configuration usage:
maxReadSize="(max characters to read at once)" sourceCharset="(character encoding)" onMatch="[include|exclude]"
Subclasses inherit the above
IXMLConfigurable
attribute(s), in addition to <restrictTo>.- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description AbstractStringFilter()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
int
getMaxReadSize()
Gets the maximum number of characters to read for filtering at once.int
hashCode()
protected abstract boolean
isStringContentMatching(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)
protected boolean
isTextDocumentMatching(HandlerDoc doc, Reader input, ParseState parseState)
protected void
loadCharStreamFilterFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected abstract void
loadStringFilterFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveCharStreamFilterToXML(XML xml)
Saves configuration settings specific to the implementing class.protected abstract void
saveStringFilterToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setMaxReadSize(int maxReadSize)
Sets the maximum number of characters to read for filtering at once.String
toString()
-
Methods inherited from class com.norconex.importer.handler.filter.AbstractCharStreamFilter
getSourceCharset, isDocumentMatched, loadFilterFromXML, saveFilterToXML, setSourceCharset
-
Methods inherited from class com.norconex.importer.handler.filter.AbstractDocumentFilter
acceptDocument, getOnMatch, loadHandlerFromXML, saveHandlerToXML, setOnMatch
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
-
-
-
Method Detail
-
isTextDocumentMatching
protected final boolean isTextDocumentMatching(HandlerDoc doc, Reader input, ParseState parseState) throws ImporterHandlerException
- Specified by:
isTextDocumentMatching
in classAbstractCharStreamFilter
- Throws:
ImporterHandlerException
-
getMaxReadSize
public int getMaxReadSize()
Gets the maximum number of characters to read for filtering at once. Default isTextReader.DEFAULT_MAX_READ_SIZE
.- Returns:
- maximum read size
-
setMaxReadSize
public void setMaxReadSize(int maxReadSize)
Sets the maximum number of characters to read for filtering at once.- Parameters:
maxReadSize
- maximum read size
-
isStringContentMatching
protected abstract boolean isStringContentMatching(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
- Throws:
ImporterHandlerException
-
saveCharStreamFilterToXML
protected final void saveCharStreamFilterToXML(XML xml)
Description copied from class:AbstractCharStreamFilter
Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveCharStreamFilterToXML
in classAbstractCharStreamFilter
- Parameters:
xml
- the XML
-
saveStringFilterToXML
protected abstract void saveStringFilterToXML(XML xml)
Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Parameters:
xml
- the XML
-
loadCharStreamFilterFromXML
protected final void loadCharStreamFilterFromXML(XML xml)
Description copied from class:AbstractCharStreamFilter
Loads configuration settings specific to the implementing class.- Specified by:
loadCharStreamFilterFromXML
in classAbstractCharStreamFilter
- Parameters:
xml
- XML configuration
-
loadStringFilterFromXML
protected abstract void loadStringFilterFromXML(XML xml)
Loads configuration settings specific to the implementing class.- Parameters:
xml
- XML configuration
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractCharStreamFilter
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractCharStreamFilter
-
toString
public String toString()
- Overrides:
toString
in classAbstractCharStreamFilter
-
-