public class CharsetTransformer extends AbstractDocumentTransformer implements IXMLConfigurable
Transforms a document content (if needed) from a source character encoding (charset) to a target one. Both the source and target character encodings are optional. If no source character encoding is explicitly provided, it first tries to detect the encoding of the document content before converting it to the target encoding. If the source character encoding cannot be established, the content encoding will remain unchanged. When no target character encoding is specified, UTF-8 is assumed.
Before using this transformer, you need to know the parsing of documents by the importer using default document parser factory will try to convert and return content as UTF-8 (for most, if not all content-types). If UTF-8 is your desired target, it only make sense to use this transformer as a pre-parsing handler (for text content-types only) when it is important to work with a specific character encoding before parsing. If on the other hand you wish to convert to a character encoding to a target different than UTF-8, you can use this transformer as a post-parsing handler to do so.
Because character encoding detection is not always accurate and because documents sometime mix different encoding, there is no guarantee this class will handle ALL character encoding conversions properly.
<transformer class="com.norconex.importer.handler.transformer.impl.CharsetTransformer" sourceCharset="(character encoding)" targetCharset="(character encoding)"> <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> </transformer>
The following converts the content of a document from "ISO-8859-1" to "UTF-8".
<transformer class="com.norconex.importer.handler.transformer.impl.CharsetTransformer" sourceCharset="ISO-8859-1" targetCharset="UTF-8" />
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_TARGET_CHARSET |
Constructor and Description |
---|
CharsetTransformer() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getSourceCharset() |
String |
getTargetCharset() |
int |
hashCode() |
protected void |
loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setSourceCharset(String sourceCharset) |
void |
setTargetCharset(String targetCharset) |
String |
toString() |
protected void |
transformApplicableDocument(String reference,
InputStream input,
OutputStream output,
ImporterMetadata metadata,
boolean parsed) |
transformDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public static final String DEFAULT_TARGET_CHARSET
protected void transformApplicableDocument(String reference, InputStream input, OutputStream output, ImporterMetadata metadata, boolean parsed) throws ImporterHandlerException
transformApplicableDocument
in class AbstractDocumentTransformer
ImporterHandlerException
public String getTargetCharset()
public void setTargetCharset(String targetCharset)
public String getSourceCharset()
public void setSourceCharset(String sourceCharset)
protected void loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- xml configurationIOException
- could not load from XMLprotected void saveHandlerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
writer
- the xml writerXMLStreamException
- could not save to XMLpublic int hashCode()
hashCode
in class AbstractImporterHandler
public boolean equals(Object other)
equals
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2021 Norconex Inc.. All rights reserved.