public class RegexTagger extends AbstractStringTagger implements IXMLConfigurable
Extracts field names and their values with regular expression.
This is done by using
match groups in your regular expressions (parenthesis). For each pattern
you define, you can specify which match group hold the field name and
which one holds the value.
Specifying a field match group is optional if a field
is provided. If no match groups are specified, a field
is expected.
If "fieldMatcher" is specified, it will use content from matching fields and storing all text extracted into the target field, multi-value. Else, the document content is used.
If a target field with the same name already exists for a document,
values will be added to the end of the existing value list.
It is possible to change this default behavior by supplying a
PropertySetter
.
This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.
<handler
class="com.norconex.importer.handler.tagger.impl.RegexTagger"
maxReadSize="(max characters to read at once)"
sourceCharset="(character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(optional expression matching source fields on which to perform extraction)
</fieldMatcher>
<!-- multiple pattern tags allowed -->
<pattern
toField="(toField name)"
fieldGroup="(toField name match group index)"
valueGroup="(value match group index)"
onSet="[append|prepend|replace|optional]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
dotAll="[false|true]"
unixLines="[false|true]"
literal="[false|true]"
comments="[false|true]"
multiline="[false|true]"
canonEq="[false|true]"
unicodeCase="[false|true]"
unicodeCharacterClass="[false|true]">
(regular expression)
</pattern>
</handler>
<handler
class="RegexTagger">
<pattern
toField="emails">
[A-Za-z0-9+_.-]+?@[a-zA-Z0-9.-]+
</pattern>
<pattern
fieldGroup="1"
valueGroup="2">
<![CDATA[
<tr><td class="label">(.*?)</td><td class="value">(.*?)</td></tr>
]]>
</pattern>
</handler>
The first pattern in the above example extracts what look like email addresses in to an "email" field (simplified regex). The second pattern extracts field names and values from "label" and "value" cells on a given HTML table.
RegexFieldValueExtractor
Constructor and Description |
---|
RegexTagger() |
Modifier and Type | Method and Description |
---|---|
void |
addPattern(RegexFieldValueExtractor... pattern)
Adds one or more pattern that will extract matching field names/values.
|
void |
addPattern(String field,
String pattern)
Adds a pattern that will extract the whole text matched into
given field.
|
void |
addPattern(String field,
String pattern,
int valueGroup)
Adds a new pattern, which will extract the value from the specified
group index upon matching.
|
boolean |
equals(Object other) |
TextMatcher |
getFieldMatcher()
Gets source field matcher for fields on which to extract fields/values.
|
List<RegexFieldValueExtractor> |
getPatterns()
Gets the patterns used to extract matching field names/values.
|
int |
hashCode() |
protected void |
loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setFieldMatcher(TextMatcher fieldMatcher)
Sets source field matcher for fields on which to extract fields/values.
|
void |
setPattern(RegexFieldValueExtractor... patterns)
Sets one or more patterns that will extract matching field names/values.
|
protected void |
tagStringContent(HandlerDoc doc,
StringBuilder content,
ParseState parseState,
int sectionIndex) |
String |
toString() |
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
tagStringContent
in class AbstractStringTagger
ImporterHandlerException
public void addPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addPattern(RegexFieldValueExtractor... pattern)
pattern
- field extractor patternpublic void setPattern(RegexFieldValueExtractor... patterns)
patterns
- field extractor patternpublic List<RegexFieldValueExtractor> getPatterns()
public TextMatcher getFieldMatcher()
public void setFieldMatcher(TextMatcher fieldMatcher)
fieldMatcher
- field matcherprotected void loadStringTaggerFromXML(XML xml)
AbstractStringTagger
loadStringTaggerFromXML
in class AbstractStringTagger
xml
- xml configurationprotected void saveStringTaggerToXML(XML xml)
AbstractStringTagger
saveStringTaggerToXML
in class AbstractStringTagger
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractStringTagger
public int hashCode()
hashCode
in class AbstractStringTagger
public String toString()
toString
in class AbstractStringTagger
Copyright © 2009–2023 Norconex Inc.. All rights reserved.