public class TextBetweenTagger extends AbstractStringTagger implements IXMLConfigurable
Extracts and add values found between a matching start and end strings to a document metadata field. The matching string end-points are defined in pairs and multiple ones can be specified at once. The field specified for a pair of end-points is considered a multi-value field.
This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.
<tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" inclusive="[false|true]" caseSensitive="[false|true]" sourceCharset="(character encoding)" maxReadSize="(max characters to read at once)" > <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <textBetween name="targetFieldName"> <start>(regex)</start> <end>(regex)</end> </textBetween> <!-- multiple textBetween tags allowed --> </tagger>
The following example extract the content between "OPEN" and "CLOSE" strings, excluding these strings, and store it in a "content" field.
<tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" > <textBetween name="content"> <start>OPEN</start> <end>CLOSE</end> </textBetween> </tagger>
Constructor and Description |
---|
TextBetweenTagger() |
Modifier and Type | Method and Description |
---|---|
void |
addTextEndpoints(String name,
String fromText,
String toText)
Adds a new pair of end points to match.
|
boolean |
equals(Object other) |
int |
hashCode() |
boolean |
isCaseSensitive() |
boolean |
isInclusive() |
protected void |
loadStringTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringTaggerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setCaseSensitive(boolean caseSensitive)
Sets whether to ignore case when matching start and end text.
|
void |
setInclusive(boolean inclusive)
Sets whether start and end text pairs should be kept or
not.
|
protected void |
tagStringContent(String reference,
StringBuilder content,
ImporterMetadata metadata,
boolean parsed,
int sectionIndex) |
String |
toString() |
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagStringContent(String reference, StringBuilder content, ImporterMetadata metadata, boolean parsed, int sectionIndex)
tagStringContent
in class AbstractStringTagger
public boolean isInclusive()
public void setInclusive(boolean inclusive)
inclusive
- true
to keep matching start and end textpublic boolean isCaseSensitive()
public void setCaseSensitive(boolean caseSensitive)
caseSensitive
- true
to consider character casepublic void addTextEndpoints(String name, String fromText, String toText)
name
- target metadata field name where to store the extracted
valuesfromText
- the left string to matchtoText
- the right string to matchprotected void loadStringTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractStringTagger
loadStringTaggerFromXML
in class AbstractStringTagger
xml
- xml configurationIOException
- could not load from XMLprotected void saveStringTaggerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractStringTagger
saveStringTaggerToXML
in class AbstractStringTagger
writer
- the xml writerXMLStreamException
- could not save to XMLpublic boolean equals(Object other)
equals
in class AbstractStringTagger
public int hashCode()
hashCode
in class AbstractStringTagger
public String toString()
toString
in class AbstractStringTagger
Copyright © 2009–2021 Norconex Inc.. All rights reserved.