Class TextPatternTagger
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.tagger.AbstractDocumentTagger
-
- com.norconex.importer.handler.tagger.AbstractCharStreamTagger
-
- com.norconex.importer.handler.tagger.AbstractStringTagger
-
- com.norconex.importer.handler.tagger.impl.TextPatternTagger
-
- All Implemented Interfaces:
IXMLConfigurable,IImporterHandler,IDocumentTagger
@Deprecated public class TextPatternTagger extends AbstractStringTagger implements IXMLConfigurable
Deprecated.Since 3.0.0, useRegexTagger.Extracts and add all text values matching the regular expression provided in to a field provided explicitly, or also matching a regular expression. The target field is considered a multi-value field.
It is possible to extract both the field names and their values with regular expression. This is done by using match groups in your regular expressions (parenthesis). For each pattern you define, you can specify which match group hold the field name and which one holds the value. Specifying a field match group is optional if a
fieldis provided. If no match groups are specified, afieldis expected.Storing values in an existing field
If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a
PropertySetter.This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.
XML configuration usage:
<handler class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" sourceCharset="(character encoding)" maxReadSize="(max characters to read at once)"> <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <pattern toField="(target field name)" fieldGroup="(field name match group index)" valueGroup="(field value match group index)" ignoreCase="[false|true]" ignoreDiacritic="[false|true]" onSet="[append|prepend|replace|optional]"> (regular expression) </pattern> <!-- multiple pattern tags allowed --> </handler>Usage example:
The first pattern in the following example extracts what look like email addresses in to an "email" field (simplified regex). The second pattern extracts field names and values from "label" and "value" cells on a given HTML table:
XML usage example:
<handler class="TextPatternTagger"> <pattern field="emails"> [A-Za-z0-9+_.-]+?@[a-zA-Z0-9.-]+ </pattern> <pattern fieldGroup="1" valueGroup="2"> <![CDATA[ <tr><td class="label">(.*?)</td><td class="value">(.*?)</td></tr> ]]> </pattern> </handler>- Since:
- 2.3.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description TextPatternTagger()Deprecated.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description voidaddPattern(RegexFieldValueExtractor... pattern)Deprecated.Adds one or more pattern that will extract matching field names/values.voidaddPattern(String field, String pattern)Deprecated.Adds a pattern that will extract the whole text matched into given field.voidaddPattern(String field, String pattern, int valueGroup)Deprecated.Adds a new pattern, which will extract the value from the specified group index upon matching.booleanequals(Object other)Deprecated.List<RegexFieldValueExtractor>getPatterns()Deprecated.Gets the patterns used to extract matching field names/values.inthashCode()Deprecated.protected voidloadStringTaggerFromXML(XML xml)Deprecated.Loads configuration settings specific to the implementing class.protected voidsaveStringTaggerToXML(XML xml)Deprecated.Saves configuration settings specific to the implementing class.voidsetPattern(RegexFieldValueExtractor... patterns)Deprecated.Sets one or more patterns that will extract matching field names/values.protected voidtagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)Deprecated.StringtoString()Deprecated.-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML
-
-
-
-
Method Detail
-
tagStringContent
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
Deprecated.- Specified by:
tagStringContentin classAbstractStringTagger- Throws:
ImporterHandlerException
-
addPattern
public void addPattern(String field, String pattern)
Deprecated.Adds a pattern that will extract the whole text matched into given field.- Parameters:
field- target field to store the matching pattern.pattern- the pattern
-
addPattern
public void addPattern(String field, String pattern, int valueGroup)
Deprecated.Adds a new pattern, which will extract the value from the specified group index upon matching.- Parameters:
field- target field to store the matching pattern.pattern- the patternvalueGroup- which pattern group to return.
-
addPattern
public void addPattern(RegexFieldValueExtractor... pattern)
Deprecated.Adds one or more pattern that will extract matching field names/values.- Parameters:
pattern- field extractor pattern
-
setPattern
public void setPattern(RegexFieldValueExtractor... patterns)
Deprecated.Sets one or more patterns that will extract matching field names/values. Clears previously set pattterns.- Parameters:
patterns- field extractor pattern
-
getPatterns
public List<RegexFieldValueExtractor> getPatterns()
Deprecated.Gets the patterns used to extract matching field names/values.- Returns:
- patterns
-
loadStringTaggerFromXML
protected void loadStringTaggerFromXML(XML xml)
Deprecated.Description copied from class:AbstractStringTaggerLoads configuration settings specific to the implementing class.- Specified by:
loadStringTaggerFromXMLin classAbstractStringTagger- Parameters:
xml- xml configuration
-
saveStringTaggerToXML
protected void saveStringTaggerToXML(XML xml)
Deprecated.Description copied from class:AbstractStringTaggerSaves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveStringTaggerToXMLin classAbstractStringTagger- Parameters:
xml- the XML
-
equals
public boolean equals(Object other)
Deprecated.- Overrides:
equalsin classAbstractStringTagger
-
hashCode
public int hashCode()
Deprecated.- Overrides:
hashCodein classAbstractStringTagger
-
toString
public String toString()
Deprecated.- Overrides:
toStringin classAbstractStringTagger
-
-