Class RegexTagger
-
- All Implemented Interfaces:
IXMLConfigurable,IImporterHandler,IDocumentTagger
public class RegexTagger extends AbstractStringTagger implements IXMLConfigurable
Extracts field names and their values with regular expression. This is done by using match groups in your regular expressions (parenthesis). For each pattern you define, you can specify which match group hold the field name and which one holds the value. Specifying a field match group is optional if a
fieldis provided. If no match groups are specified, afieldis expected.If "fieldMatcher" is specified, it will use content from matching fields and storing all text extracted into the target field, multi-value. Else, the document content is used.
Storing values in an existing field
If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a
PropertySetter.This class can be used as a pre-parsing handler on text documents only or a post-parsing handler.
XML configuration usage:
<handler class="com.norconex.importer.handler.tagger.impl.RegexTagger" maxReadSize="(max characters to read at once)" sourceCharset="(character encoding)"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> <fieldMatcher> (optional expression matching source fields on which to perform extraction) </fieldMatcher> <!-- multiple pattern tags allowed --> <pattern>(regular expression)</pattern> </handler>XML usage example:
<handler class="RegexTagger"> <pattern toField="emails"> [A-Za-z0-9+_.-]+?@[a-zA-Z0-9.-]+ </pattern> <pattern fieldGroup="1" valueGroup="2"> <![CDATA[ <tr><td class="label">(.*?)</td><td class="value">(.*?)</td></tr> ]]> </pattern> </handler>The first pattern in the above example extracts what look like email addresses in to an "email" field (simplified regex). The second pattern extracts field names and values from "label" and "value" cells on a given HTML table.
- Since:
- 3.0.0, based on former TextPatternTagger
- Author:
- Pascal Essiembre
- See Also:
RegexFieldValueExtractor
-
-
Constructor Summary
Constructors Constructor Description RegexTagger()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddPattern(RegexFieldValueExtractor... pattern)Adds one or more pattern that will extract matching field names/values.voidaddPattern(String field, String pattern)Adds a pattern that will extract the whole text matched into given field.voidaddPattern(String field, String pattern, int valueGroup)Adds a new pattern, which will extract the value from the specified group index upon matching.booleanequals(Object other)TextMatchergetFieldMatcher()Gets source field matcher for fields on which to extract fields/values.List<RegexFieldValueExtractor>getPatterns()Gets the patterns used to extract matching field names/values.inthashCode()protected voidloadStringTaggerFromXML(XML xml)Loads configuration settings specific to the implementing class.protected voidsaveStringTaggerToXML(XML xml)Saves configuration settings specific to the implementing class.voidsetFieldMatcher(TextMatcher fieldMatcher)Sets source field matcher for fields on which to extract fields/values.voidsetPattern(RegexFieldValueExtractor... patterns)Sets one or more patterns that will extract matching field names/values.protected voidtagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)StringtoString()-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML
-
-
-
-
Method Detail
-
tagStringContent
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
- Specified by:
tagStringContentin classAbstractStringTagger- Throws:
ImporterHandlerException
-
addPattern
public void addPattern(String field, String pattern)
Adds a pattern that will extract the whole text matched into given field.- Parameters:
field- target field to store the matching pattern.pattern- the pattern
-
addPattern
public void addPattern(String field, String pattern, int valueGroup)
Adds a new pattern, which will extract the value from the specified group index upon matching.- Parameters:
field- target field to store the matching pattern.pattern- the patternvalueGroup- which pattern group to return.
-
addPattern
public void addPattern(RegexFieldValueExtractor... pattern)
Adds one or more pattern that will extract matching field names/values.- Parameters:
pattern- field extractor pattern
-
setPattern
public void setPattern(RegexFieldValueExtractor... patterns)
Sets one or more patterns that will extract matching field names/values. Clears previously set pattterns.- Parameters:
patterns- field extractor pattern
-
getPatterns
public List<RegexFieldValueExtractor> getPatterns()
Gets the patterns used to extract matching field names/values.- Returns:
- patterns
-
getFieldMatcher
public TextMatcher getFieldMatcher()
Gets source field matcher for fields on which to extract fields/values.- Returns:
- field matcher
-
setFieldMatcher
public void setFieldMatcher(TextMatcher fieldMatcher)
Sets source field matcher for fields on which to extract fields/values.- Parameters:
fieldMatcher- field matcher
-
loadStringTaggerFromXML
protected void loadStringTaggerFromXML(XML xml)
Description copied from class:AbstractStringTaggerLoads configuration settings specific to the implementing class.- Specified by:
loadStringTaggerFromXMLin classAbstractStringTagger- Parameters:
xml- xml configuration
-
saveStringTaggerToXML
protected void saveStringTaggerToXML(XML xml)
Description copied from class:AbstractStringTaggerSaves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveStringTaggerToXMLin classAbstractStringTagger- Parameters:
xml- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equalsin classAbstractStringTagger
-
hashCode
public int hashCode()
- Overrides:
hashCodein classAbstractStringTagger
-
toString
public String toString()
- Overrides:
toStringin classAbstractStringTagger
-
-