public class TextBetweenTagger extends AbstractStringTagger implements IXMLConfigurable
Extracts and add values found between a matching start and end strings to a document metadata field. The matching string end-points are defined in pairs and multiple ones can be specified at once. The field specified for a pair of end-points is considered a multi-value field.
If "fieldMatcher" is specified, it will use content from matching fields and storing all text extracted into the target field, multi-value. Else, the document content is used.
If a target field with the same name already exists for a document,
values will be added to the end of the existing value list.
It is possible to change this default behavior by supplying a
PropertySetter
.
This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.
<handler
class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger"
maxReadSize="(max characters to read at once)"
sourceCharset="(character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<!-- multiple textBetween tags allowed -->
<textBetween
toField="(target field name)"
inclusive="[false|true]"
onSet="[append|prepend|replace|optional]">
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(optional expression matching fields to perform extraction on)
</fieldMatcher>
<startMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(expression matching "left" delimiter)
</startMatcher>
<endMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(expression matching "right" delimiter)
</endMatcher>
</textBetween>
</handler>
<handler
class="TextBetweenTagger">
<textBetween
toField="content">
<startMatcher>OPEN</startMatcher>
<endMatcher>CLOSE</endMatcher>
</textBetween>
</handler>
The above example extract the content between "OPEN" and "CLOSE" strings, excluding these strings, and store it in a "content" field.
Modifier and Type | Class and Description |
---|---|
static class |
TextBetweenTagger.TextBetweenDetails |
Constructor and Description |
---|
TextBetweenTagger() |
Modifier and Type | Method and Description |
---|---|
void |
addTextBetweenDetails(TextBetweenTagger.TextBetweenDetails details)
Adds text between instructions.
|
void |
addTextEndpoints(String toField,
String fromText,
String toText)
Deprecated.
Since 3.0.0, use
addTextBetweenDetails(TextBetweenDetails) |
boolean |
equals(Object other) |
List<TextBetweenTagger.TextBetweenDetails> |
getTextBetweenDetailsList()
Gets text between instructions.
|
int |
hashCode() |
boolean |
isCaseSensitive()
Deprecated.
Since 3.0.0, use
TextBetweenTagger.TextBetweenDetails.isCaseSensitive() |
boolean |
isInclusive()
Deprecated.
Since 3.0.0, use
TextBetweenTagger.TextBetweenDetails.isInclusive() |
protected void |
loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setCaseSensitive(boolean caseSensitive)
Deprecated.
Since 3.0.0,
use
TextBetweenTagger.TextBetweenDetails.setCaseSensitive(boolean) |
void |
setInclusive(boolean inclusive)
Deprecated.
Since 3.0.0, use
TextBetweenTagger.TextBetweenDetails.setInclusive(boolean) |
protected void |
tagStringContent(HandlerDoc doc,
StringBuilder content,
ParseState parseState,
int sectionIndex) |
String |
toString() |
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
tagStringContent
in class AbstractStringTagger
ImporterHandlerException
@Deprecated public boolean isInclusive()
TextBetweenTagger.TextBetweenDetails.isInclusive()
false
@Deprecated public void setInclusive(boolean inclusive)
TextBetweenTagger.TextBetweenDetails.setInclusive(boolean)
inclusive
- true
to keep matching start and end text@Deprecated public boolean isCaseSensitive()
TextBetweenTagger.TextBetweenDetails.isCaseSensitive()
false
@Deprecated public void setCaseSensitive(boolean caseSensitive)
TextBetweenTagger.TextBetweenDetails.setCaseSensitive(boolean)
caseSensitive
- true
to consider character case@Deprecated public void addTextEndpoints(String toField, String fromText, String toText)
addTextBetweenDetails(TextBetweenDetails)
toField
- target metadata field name where to store the extracted
valuesfromText
- the left string to matchtoText
- the right string to matchpublic void addTextBetweenDetails(TextBetweenTagger.TextBetweenDetails details)
details
- "text between" detailspublic List<TextBetweenTagger.TextBetweenDetails> getTextBetweenDetailsList()
protected void loadStringTaggerFromXML(XML xml)
AbstractStringTagger
loadStringTaggerFromXML
in class AbstractStringTagger
xml
- xml configurationprotected void saveStringTaggerToXML(XML xml)
AbstractStringTagger
saveStringTaggerToXML
in class AbstractStringTagger
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractStringTagger
public int hashCode()
hashCode
in class AbstractStringTagger
public String toString()
toString
in class AbstractStringTagger
Copyright © 2009–2023 Norconex Inc.. All rights reserved.