Class TextBetweenTagger
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.tagger.AbstractDocumentTagger
-
- com.norconex.importer.handler.tagger.AbstractCharStreamTagger
-
- com.norconex.importer.handler.tagger.AbstractStringTagger
-
- com.norconex.importer.handler.tagger.impl.TextBetweenTagger
-
- All Implemented Interfaces:
IXMLConfigurable,IImporterHandler,IDocumentTagger
public class TextBetweenTagger extends AbstractStringTagger implements IXMLConfigurable
Extracts and add values found between a matching start and end strings to a document metadata field. The matching string end-points are defined in pairs and multiple ones can be specified at once. The field specified for a pair of end-points is considered a multi-value field.
If "fieldMatcher" is specified, it will use content from matching fields and storing all text extracted into the target field, multi-value. Else, the document content is used.
Storing values in an existing field
If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a
PropertySetter.This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.
XML configuration usage:
<handler class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" maxReadSize="(max characters to read at once)" sourceCharset="(character encoding)"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> <!-- multiple textBetween tags allowed --> <textBetween toField="(target field name)" inclusive="[false|true]"> <fieldMatcher> (optional expression matching fields to perform extraction on) </fieldMatcher> <startMatcher>(expression matching "left" delimiter)</startMatcher> <endMatcher>(expression matching "right" delimiter)</endMatcher> </textBetween> </handler>XML usage example:
<handler class="TextBetweenTagger"> <textBetween toField="content"> <startMatcher>OPEN</startMatcher> <endMatcher>CLOSE</endMatcher> </textBetween> </handler>The above example extract the content between "OPEN" and "CLOSE" strings, excluding these strings, and store it in a "content" field.
- Author:
- Pascal Essiembre
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classTextBetweenTagger.TextBetweenDetails
-
Constructor Summary
Constructors Constructor Description TextBetweenTagger()
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description voidaddTextBetweenDetails(TextBetweenTagger.TextBetweenDetails details)Adds text between instructions.voidaddTextEndpoints(String toField, String fromText, String toText)Deprecated.Since 3.0.0, useaddTextBetweenDetails(TextBetweenDetails)booleanequals(Object other)List<TextBetweenTagger.TextBetweenDetails>getTextBetweenDetailsList()Gets text between instructions.inthashCode()booleanisCaseSensitive()Deprecated.Since 3.0.0, useTextBetweenTagger.TextBetweenDetails.isCaseSensitive()booleanisInclusive()Deprecated.Since 3.0.0, useTextBetweenTagger.TextBetweenDetails.isInclusive()protected voidloadStringTaggerFromXML(XML xml)Loads configuration settings specific to the implementing class.protected voidsaveStringTaggerToXML(XML xml)Saves configuration settings specific to the implementing class.voidsetCaseSensitive(boolean caseSensitive)Deprecated.Since 3.0.0, useTextBetweenTagger.TextBetweenDetails.setCaseSensitive(boolean)voidsetInclusive(boolean inclusive)Deprecated.Since 3.0.0, useTextBetweenTagger.TextBetweenDetails.setInclusive(boolean)protected voidtagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)StringtoString()-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML
-
-
-
-
Method Detail
-
tagStringContent
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
- Specified by:
tagStringContentin classAbstractStringTagger- Throws:
ImporterHandlerException
-
isInclusive
@Deprecated public boolean isInclusive()
Deprecated.Since 3.0.0, useTextBetweenTagger.TextBetweenDetails.isInclusive()Gets whether start and end text pairs should be kept or not.- Returns:
- always
false
-
setInclusive
@Deprecated public void setInclusive(boolean inclusive)
Deprecated.Since 3.0.0, useTextBetweenTagger.TextBetweenDetails.setInclusive(boolean)Sets whether start and end text pairs should be kept or not. Calling this method has no effect.- Parameters:
inclusive-trueto keep matching start and end text
-
isCaseSensitive
@Deprecated public boolean isCaseSensitive()
Deprecated.Since 3.0.0, useTextBetweenTagger.TextBetweenDetails.isCaseSensitive()Gets whether to ignore case when matching start and end text.- Returns:
- always
false
-
setCaseSensitive
@Deprecated public void setCaseSensitive(boolean caseSensitive)
Deprecated.Since 3.0.0, useTextBetweenTagger.TextBetweenDetails.setCaseSensitive(boolean)Sets whether to ignore case when matching start and end text. Calling this method has no effect.- Parameters:
caseSensitive-trueto consider character case
-
addTextEndpoints
@Deprecated public void addTextEndpoints(String toField, String fromText, String toText)
Deprecated.Since 3.0.0, useaddTextBetweenDetails(TextBetweenDetails)Adds a new pair of end points to match.- Parameters:
toField- target metadata field name where to store the extracted valuesfromText- the left string to matchtoText- the right string to match
-
addTextBetweenDetails
public void addTextBetweenDetails(TextBetweenTagger.TextBetweenDetails details)
Adds text between instructions.- Parameters:
details- "text between" details
-
getTextBetweenDetailsList
public List<TextBetweenTagger.TextBetweenDetails> getTextBetweenDetailsList()
Gets text between instructions.- Returns:
- "text between" details
- Since:
- 3.0.0
-
loadStringTaggerFromXML
protected void loadStringTaggerFromXML(XML xml)
Description copied from class:AbstractStringTaggerLoads configuration settings specific to the implementing class.- Specified by:
loadStringTaggerFromXMLin classAbstractStringTagger- Parameters:
xml- xml configuration
-
saveStringTaggerToXML
protected void saveStringTaggerToXML(XML xml)
Description copied from class:AbstractStringTaggerSaves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveStringTaggerToXMLin classAbstractStringTagger- Parameters:
xml- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equalsin classAbstractStringTagger
-
hashCode
public int hashCode()
- Overrides:
hashCodein classAbstractStringTagger
-
toString
public String toString()
- Overrides:
toStringin classAbstractStringTagger
-
-