RegexTagger
.@Deprecated public class TextPatternTagger extends AbstractStringTagger implements IXMLConfigurable
Extracts and add all text values matching the regular expression provided in to a field provided explicitly, or also matching a regular expression. The target field is considered a multi-value field.
It is possible to extract both the field names
and their values with regular expression. This is done by using
match groups in your regular expressions (parenthesis). For each pattern
you define, you can specify which match group hold the field name and
which one holds the value.
Specifying a field match group is optional if a field
is provided. If no match groups are specified, a field
is expected.
If a target field with the same name already exists for a document,
values will be added to the end of the existing value list.
It is possible to change this default behavior by supplying a
PropertySetter
.
This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.
<handler
class="com.norconex.importer.handler.tagger.impl.TextPatternTagger"
sourceCharset="(character encoding)"
maxReadSize="(max characters to read at once)">
<restrictTo
caseSensitive="[false|true]"
field="(name of header/metadata field name to match)">
(regular expression of value to match)
</restrictTo>
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<pattern
toField="(target field name)"
fieldGroup="(field name match group index)"
valueGroup="(field value match group index)"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
onSet="[append|prepend|replace|optional]">
(regular expression)
</pattern>
<!-- multiple pattern tags allowed -->
</handler>
The first pattern in the following example extracts what look like email addresses in to an "email" field (simplified regex). The second pattern extracts field names and values from "label" and "value" cells on a given HTML table:
<handler
class="TextPatternTagger">
<pattern
field="emails">
[A-Za-z0-9+_.-]+?@[a-zA-Z0-9.-]+
</pattern>
<pattern
fieldGroup="1"
valueGroup="2">
<![CDATA[
<tr><td class="label">(.*?)</td><td class="value">(.*?)</td></tr>
]]>
</pattern>
</handler>
Constructor and Description |
---|
TextPatternTagger()
Deprecated.
|
Modifier and Type | Method and Description |
---|---|
void |
addPattern(RegexFieldValueExtractor... pattern)
Deprecated.
Adds one or more pattern that will extract matching field names/values.
|
void |
addPattern(String field,
String pattern)
Deprecated.
Adds a pattern that will extract the whole text matched into
given field.
|
void |
addPattern(String field,
String pattern,
int valueGroup)
Deprecated.
Adds a new pattern, which will extract the value from the specified
group index upon matching.
|
boolean |
equals(Object other)
Deprecated.
|
List<RegexFieldValueExtractor> |
getPatterns()
Deprecated.
Gets the patterns used to extract matching field names/values.
|
int |
hashCode()
Deprecated.
|
protected void |
loadStringTaggerFromXML(XML xml)
Deprecated.
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringTaggerToXML(XML xml)
Deprecated.
Saves configuration settings specific to the implementing class.
|
void |
setPattern(RegexFieldValueExtractor... patterns)
Deprecated.
Sets one or more patterns that will extract matching field names/values.
|
protected void |
tagStringContent(HandlerDoc doc,
StringBuilder content,
ParseState parseState,
int sectionIndex)
Deprecated.
|
String |
toString()
Deprecated.
|
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
tagStringContent
in class AbstractStringTagger
ImporterHandlerException
public void addPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addPattern(RegexFieldValueExtractor... pattern)
pattern
- field extractor patternpublic void setPattern(RegexFieldValueExtractor... patterns)
patterns
- field extractor patternpublic List<RegexFieldValueExtractor> getPatterns()
protected void loadStringTaggerFromXML(XML xml)
AbstractStringTagger
loadStringTaggerFromXML
in class AbstractStringTagger
xml
- xml configurationprotected void saveStringTaggerToXML(XML xml)
AbstractStringTagger
saveStringTaggerToXML
in class AbstractStringTagger
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractStringTagger
public int hashCode()
hashCode
in class AbstractStringTagger
public String toString()
toString
in class AbstractStringTagger
Copyright © 2009–2023 Norconex Inc.. All rights reserved.