public class CharsetTagger extends AbstractDocumentTagger implements IXMLConfigurable
Converts one or more field values (if needed) from a source character encoding (charset) to a target one. Both the source and target character encodings are optional. If no source character encoding is explicitly provided, it first tries to detect the encoding of the field values before converting them to the target encoding. If the source character encoding cannot be established, the content encoding will remain unchanged. When no target character encoding is specified, UTF-8 is assumed.
Before using this tagger, you need to know the parsing of documents by the importer (using the default document parser factory) will try to convert and return fields as UTF-8 (for most, if not all content-types). If UTF-8 is your desired target, it only make sense to use this tagger as a pre-parsing handler (for text content-types only) when it is important to work with a specific character encoding before parsing. If on the other hand you wish to convert to a character encoding to a target different than UTF-8, you can use this tagger as a post-parsing handler to do so.
Because character encoding detection is not always accurate and because documents sometime mix different encoding, there is no guarantee this class will handle ALL character encoding conversions properly.
<handler
class="com.norconex.importer.handler.tagger.impl.CharsetTagger"
sourceCharset="(character encoding)"
targetCharset="(character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(expression matching fields to be converted)
</fieldMatcher>
</handler>
<handler
class="CharsetTagger"
sourceCharset="ISO-8859-1"
targetCharset="UTF-8">
<fieldMatcher>description</fieldMatcher>
</handler>
The above example converts the characters of a "description" field from "ISO-8859-1" to "UTF-8".
CharsetTransformer
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_TARGET_CHARSET |
Constructor and Description |
---|
CharsetTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
TextMatcher |
getFieldMatcher()
Gets field matcher.
|
String |
getFieldsRegex()
Deprecated.
Since 3.0.0, use
getFieldMatcher() . |
String |
getSourceCharset() |
String |
getTargetCharset() |
int |
hashCode() |
protected void |
loadHandlerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setFieldMatcher(TextMatcher fieldMatcher)
Set field matcher (copy).
|
void |
setFieldsRegex(String fieldsRegex)
Deprecated.
Since 3.0.0, use
setFieldMatcher(TextMatcher) |
void |
setSourceCharset(String sourceCharset) |
void |
setTargetCharset(String targetCharset) |
void |
tagApplicableDocument(HandlerDoc doc,
InputStream document,
ParseState parseState) |
String |
toString() |
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public static final String DEFAULT_TARGET_CHARSET
public void tagApplicableDocument(HandlerDoc doc, InputStream document, ParseState parseState) throws ImporterHandlerException
tagApplicableDocument
in class AbstractDocumentTagger
ImporterHandlerException
@Deprecated public String getFieldsRegex()
getFieldMatcher()
.@Deprecated public void setFieldsRegex(String fieldsRegex)
setFieldMatcher(TextMatcher)
fieldsRegex
- regular expressiopmpublic TextMatcher getFieldMatcher()
public void setFieldMatcher(TextMatcher fieldMatcher)
fieldMatcher
- field matcherpublic String getTargetCharset()
public void setTargetCharset(String targetCharset)
public String getSourceCharset()
public void setSourceCharset(String sourceCharset)
protected void loadHandlerFromXML(XML xml)
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- XML configurationprotected void saveHandlerToXML(XML xml)
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2023 Norconex Inc.. All rights reserved.