public class RegexContentFilter extends AbstractStringFilter
Filters a document based on a pattern matching in its content. Based
on document size, it is possible the pattern matching will be done
in chunks, sometimes not achieving expected results. Consider
using AbstractCharStreamFilter
if this is a concern.
Refer to AbstractDocumentFilter
for the inclusion/exclusion logic.
Since 2.2.0, the following regular expression flags are always
active: Pattern.MULTILINE
and Pattern.DOTALL
.
<filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter" onMatch="[include|exclude]" caseSensitive="[false|true]" sourceCharset="(character encoding)" maxReadSize="(max characters to read at once)" > <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <regex>(regular expression of value to match)</regex> </filter>
This example will accept only documents containing word "apple".
<filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter" onMatch="include" > <regex>.*apple.*</regex> </filter>
Constructor and Description |
---|
RegexContentFilter() |
RegexContentFilter(String regex) |
RegexContentFilter(String regex,
OnMatch onMatch) |
RegexContentFilter(String regex,
OnMatch onMatch,
boolean caseSensitive) |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getRegex() |
int |
hashCode() |
boolean |
isCaseSensitive() |
protected boolean |
isStringContentMatching(String reference,
StringBuilder content,
ImporterMetadata metadata,
boolean parsed,
int sectionIndex) |
protected void |
loadStringFilterFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringFilterToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setCaseSensitive(boolean caseSensitive) |
void |
setRegex(String regex) |
String |
toString() |
getMaxReadSize, isTextDocumentMatching, loadCharStreamFilterFromXML, saveCharStreamFilterToXML, setMaxReadSize
getSourceCharset, isDocumentMatched, loadFilterFromXML, saveFilterToXML, setSourceCharset
acceptDocument, getOnMatch, loadHandlerFromXML, saveHandlerToXML, setOnMatch
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
public RegexContentFilter()
public RegexContentFilter(String regex)
public String getRegex()
public final void setRegex(String regex)
public boolean isCaseSensitive()
public void setCaseSensitive(boolean caseSensitive)
protected boolean isStringContentMatching(String reference, StringBuilder content, ImporterMetadata metadata, boolean parsed, int sectionIndex) throws ImporterHandlerException
isStringContentMatching
in class AbstractStringFilter
ImporterHandlerException
protected void saveStringFilterToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractStringFilter
saveStringFilterToXML
in class AbstractStringFilter
writer
- the xml writerXMLStreamException
- could not save to XMLprotected void loadStringFilterFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractStringFilter
loadStringFilterFromXML
in class AbstractStringFilter
xml
- xml configurationIOException
- could not load from XMLpublic boolean equals(Object other)
equals
in class AbstractStringFilter
public int hashCode()
hashCode
in class AbstractStringFilter
public String toString()
toString
in class AbstractStringFilter
Copyright © 2009–2021 Norconex Inc.. All rights reserved.