public class TextPatternTagger extends AbstractStringTagger implements IXMLConfigurable
Extracts and add all text values matching the regular expression provided in to a field provided explicitely, or also matching a regular expression. The target field is considered a multi-value field.
Since 2.8.0, it is now possible to extract both the field names
and their values with regular expression. This is done by using
match groups in your regular expressions (parenthesis). For each pattern
you define, you can specify which match group hold the field name and
which one holds the value.
Specifying a field match group is optional if a field
is provided. If no match groups are specified, a field
is expected.
Since 2.8.0, case-sensitivity for regular expressions is now set on each patterns.
This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" sourceCharset="(character encoding)" maxReadSize="(max characters to read at once)" > <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <pattern field="(target field name)" fieldGroup="(field name match group index)" valueGroup="(field value match group index)" caseSensitive="[false|true]"> (regular expression) </pattern> <!-- multiple pattern tags allowed --> </tagger>
The first pattern in the following example extracts what look like email addresses in to an "email" field (simplified regex). The second pattern extracts field names and values from "label" and "value" cells on a given HTML table:
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" > <pattern field="emails"> [A-Za-z0-9+_.-]+?@[a-zA-Z0-9.-]+ </pattern> <pattern fieldGroup="1" valueGroup="2"><![CDATA[ <tr><td class="label">(.*?)</td><td class="value">(.*?)</td></tr> ]]></pattern> </tagger>
Constructor and Description |
---|
TextPatternTagger() |
Modifier and Type | Method and Description |
---|---|
void |
addPattern(RegexFieldExtractor... pattern)
Adds one or more pattern that will extract matching field names/values.
|
void |
addPattern(String field,
String pattern)
Adds a pattern that will extract the whole text matched into
given field.
|
void |
addPattern(String field,
String pattern,
int valueGroup)
Adds a new pattern, which will extract the value from the specified
group index upon matching.
|
boolean |
equals(Object other) |
List<RegexFieldExtractor> |
getPatterns()
Gets the patterns used to extract matching field names/values.
|
int |
hashCode() |
boolean |
isCaseSensitive()
Deprecated.
Always false. Case sensitivity is now set from each pattern
|
protected void |
loadStringTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringTaggerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setCaseSensitive(boolean caseSensitive)
Deprecated.
Always false. Case sensitivity is now set on each pattern
|
void |
setPattern(RegexFieldExtractor... pattern)
Sets one or more patterns that will extract matching field names/values.
|
protected void |
tagStringContent(String reference,
StringBuilder content,
ImporterMetadata metadata,
boolean parsed,
int sectionIndex) |
String |
toString() |
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagStringContent(String reference, StringBuilder content, ImporterMetadata metadata, boolean parsed, int sectionIndex)
tagStringContent
in class AbstractStringTagger
@Deprecated public boolean isCaseSensitive()
true
if case sensitive.@Deprecated public void setCaseSensitive(boolean caseSensitive)
caseSensitive
- true
to consider character casepublic void addPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addPattern(RegexFieldExtractor... pattern)
pattern
- field extractor patternpublic void setPattern(RegexFieldExtractor... pattern)
pattern
- field extractor patternpublic List<RegexFieldExtractor> getPatterns()
protected void loadStringTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractStringTagger
loadStringTaggerFromXML
in class AbstractStringTagger
xml
- xml configurationIOException
- could not load from XMLprotected void saveStringTaggerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractStringTagger
saveStringTaggerToXML
in class AbstractStringTagger
writer
- the xml writerXMLStreamException
- could not save to XMLpublic boolean equals(Object other)
equals
in class AbstractStringTagger
public int hashCode()
hashCode
in class AbstractStringTagger
public String toString()
toString
in class AbstractStringTagger
Copyright © 2009–2021 Norconex Inc.. All rights reserved.