public class CharsetTagger extends AbstractDocumentTagger implements IXMLConfigurable
Converts one or more field values (if needed) from a source character encoding (charset) to a target one. Both the source and target character encodings are optional. If no source character encoding is explicitly provided, it first tries to detect the encoding of the field values before converting them to the target encoding. If the source character encoding cannot be established, the content encoding will remain unchanged. When no target character encoding is specified, UTF-8 is assumed.
Before using this tagger, you need to know the parsing of documents by the importer using default document parser factory will try to convert and return fields as UTF-8 (for most, if not all content-types). If UTF-8 is your desired target, it only make sense to use this tagger as a pre-parsing handler (for text content-types only) when it is important to work with a specific character encoding before parsing. If on the other hand you wish to convert to a character encoding to a target different than UTF-8, you can use this tagger as a post-parsing handler to do so.
Because character encoding detection is not always accurate and because documents sometime mix different encoding, there is no guarantee this class will handle ALL character encoding conversions properly.
<tagger class="com.norconex.importer.handler.tagger.impl.CharsetTagger" sourceCharset="(character encoding)" targetCharset="(character encoding)"> <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <fieldsRegex>(regex matching fields to detect encoding)</fieldsRegex> </tagger>
Converts the characters of a "description" field from "ISO-8859-1" to "UTF-8".
<tagger class="com.norconex.importer.handler.tagger.impl.CharsetTagger" sourceCharset="ISO-8859-1" targetCharset="UTF-8"> <fieldsRegex>description</fieldsRegex> </tagger>
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_TARGET_CHARSET |
Constructor and Description |
---|
CharsetTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getFieldsRegex() |
String |
getSourceCharset() |
String |
getTargetCharset() |
int |
hashCode() |
protected void |
loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setFieldsRegex(String fieldsRegex) |
void |
setSourceCharset(String sourceCharset) |
void |
setTargetCharset(String targetCharset) |
protected void |
tagApplicableDocument(String reference,
InputStream document,
ImporterMetadata metadata,
boolean parsed) |
String |
toString() |
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public static final String DEFAULT_TARGET_CHARSET
protected void tagApplicableDocument(String reference, InputStream document, ImporterMetadata metadata, boolean parsed) throws ImporterHandlerException
tagApplicableDocument
in class AbstractDocumentTagger
ImporterHandlerException
public String getFieldsRegex()
public void setFieldsRegex(String fieldsRegex)
public String getTargetCharset()
public void setTargetCharset(String targetCharset)
public String getSourceCharset()
public void setSourceCharset(String sourceCharset)
protected void loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- xml configurationIOException
- could not load from XMLprotected void saveHandlerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
writer
- the xml writerXMLStreamException
- could not save to XMLpublic int hashCode()
hashCode
in class AbstractImporterHandler
public boolean equals(Object other)
equals
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2021 Norconex Inc.. All rights reserved.