public class CharsetTransformer extends AbstractDocumentTransformer implements IXMLConfigurable
Transforms a document content (if needed) from a source character encoding (charset) to a target one. Both the source and target character encodings are optional. If no source character encoding is explicitly provided, it first tries to detect the encoding of the document content before converting it to the target encoding. If the source character encoding cannot be established, the content encoding will remain unchanged. When no target character encoding is specified, UTF-8 is assumed.
Before using this transformer, you need to know the parsing of documents by the importer using default document parser factory will try to convert and return content as UTF-8 (for most, if not all content-types). If UTF-8 is your desired target, it only make sense to use this transformer as a pre-parsing handler (for text content-types only) when it is important to work with a specific character encoding before parsing. If on the other hand you wish to convert to a character encoding to a target different than UTF-8, you can use this transformer as a post-parsing handler to do so.
Because character encoding detection is not always accurate and because documents sometime mix different encoding, there is no guarantee this class will handle ALL character encoding conversions properly.
<handler
class="com.norconex.importer.handler.transformer.impl.CharsetTransformer"
sourceCharset="(character encoding)"
targetCharset="(character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
</handler>
<handler
class="CharsetTransformer"
sourceCharset="ISO-8859-1"
targetCharset="UTF-8"/>
The above example converts the content of a document from "ISO-8859-1" to "UTF-8".
CharsetTagger
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_TARGET_CHARSET |
Constructor and Description |
---|
CharsetTransformer() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getSourceCharset() |
String |
getTargetCharset() |
int |
hashCode() |
protected void |
loadHandlerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setSourceCharset(String sourceCharset) |
void |
setTargetCharset(String targetCharset) |
String |
toString() |
protected void |
transformApplicableDocument(HandlerDoc doc,
InputStream input,
OutputStream output,
ParseState parseState) |
transformDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public static final String DEFAULT_TARGET_CHARSET
protected void transformApplicableDocument(HandlerDoc doc, InputStream input, OutputStream output, ParseState parseState) throws ImporterHandlerException
transformApplicableDocument
in class AbstractDocumentTransformer
ImporterHandlerException
public String getTargetCharset()
public void setTargetCharset(String targetCharset)
public String getSourceCharset()
public void setSourceCharset(String sourceCharset)
protected void loadHandlerFromXML(XML xml)
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- XML configurationprotected void saveHandlerToXML(XML xml)
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2023 Norconex Inc.. All rights reserved.