public class SplitTagger extends AbstractCharStreamTagger
Splits an existing metadata value into multiple values based on a given value separator (the separator gets discarded). The "toField" argument is optional (the same field will be used to store the splits if no "toField" is specified"). Duplicates are removed.
Can be used both as a pre-parse (metadata or text content) or post-parse handler.
If no "fieldMatcher" expression is specified, the document content will be used. If the "fieldMatcher" matches more than one field, they will all be split and stored in the same multi-value metadata field.
If a target field with the same name already exists for a document,
values will be added to the end of the existing value list.
It is possible to change this default behavior by supplying a
PropertySetter
.
<handler
class="com.norconex.importer.handler.tagger.impl.SplitTagger"
sourceCharset="(character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<!-- multiple split tags allowed -->
<split
toField="targetFieldName"
onSet="[append|prepend|replace|optional]">
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(one or more matching fields to split)
</fieldMatcher>
<separator
regex="[false|true]">
(separator value)
</separator>
</split>
</handler>
<handler
class="SplitTagger">
<split>
<fieldMatcher>myField</fieldMatcher>
<separator
regex="true">
\s*,\s*
</separator>
</split>
</handler>
The above example splits a single value field holding a comma-separated list into multiple values.
Modifier and Type | Class and Description |
---|---|
static class |
SplitTagger.SplitDetails |
Constructor and Description |
---|
SplitTagger() |
Modifier and Type | Method and Description |
---|---|
void |
addSplit(String fromField,
String separator,
boolean regex)
Deprecated.
|
void |
addSplit(String fromField,
String toField,
String separator,
boolean regex)
Deprecated.
|
void |
addSplitDetails(SplitTagger.SplitDetails sd) |
boolean |
equals(Object other) |
List<SplitTagger.SplitDetails> |
getSplitDetailsList() |
List<SplitTagger.SplitDetails> |
getSplits()
Deprecated.
|
int |
hashCode() |
protected void |
loadCharStreamTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
void |
removeSplit(String fromField)
Deprecated.
|
void |
removeSplitDetails(String fromField) |
protected void |
saveCharStreamTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
protected void |
tagTextDocument(HandlerDoc doc,
Reader input,
ParseState parseState) |
String |
toString() |
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
protected void tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState) throws ImporterHandlerException
tagTextDocument
in class AbstractCharStreamTagger
ImporterHandlerException
public List<SplitTagger.SplitDetails> getSplitDetailsList()
@Deprecated public List<SplitTagger.SplitDetails> getSplits()
public void removeSplitDetails(String fromField)
@Deprecated public void removeSplit(String fromField)
public void addSplitDetails(SplitTagger.SplitDetails sd)
@Deprecated public void addSplit(String fromField, String separator, boolean regex)
@Deprecated public void addSplit(String fromField, String toField, String separator, boolean regex)
protected void loadCharStreamTaggerFromXML(XML xml)
AbstractCharStreamTagger
loadCharStreamTaggerFromXML
in class AbstractCharStreamTagger
xml
- xml configurationprotected void saveCharStreamTaggerToXML(XML xml)
AbstractCharStreamTagger
saveCharStreamTaggerToXML
in class AbstractCharStreamTagger
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractCharStreamTagger
public int hashCode()
hashCode
in class AbstractCharStreamTagger
public String toString()
toString
in class AbstractCharStreamTagger
Copyright © 2009–2023 Norconex Inc.. All rights reserved.