Class CharsetTransformer

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTransformer

    public class CharsetTransformer
    extends AbstractDocumentTransformer
    implements IXMLConfigurable

    Transforms a document content (if needed) from a source character encoding (charset) to a target one. Both the source and target character encodings are optional. If no source character encoding is explicitly provided, it first tries to detect the encoding of the document content before converting it to the target encoding. If the source character encoding cannot be established, the content encoding will remain unchanged. When no target character encoding is specified, UTF-8 is assumed.

    Should I use this transformer?

    Before using this transformer, you need to know the parsing of documents by the importer using default document parser factory will try to convert and return content as UTF-8 (for most, if not all content-types). If UTF-8 is your desired target, it only make sense to use this transformer as a pre-parsing handler (for text content-types only) when it is important to work with a specific character encoding before parsing. If on the other hand you wish to convert to a character encoding to a target different than UTF-8, you can use this transformer as a post-parsing handler to do so.

    Conversion is not flawless

    Because character encoding detection is not always accurate and because documents sometime mix different encoding, there is no guarantee this class will handle ALL character encoding conversions properly.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.transformer.impl.CharsetTransformer"
        sourceCharset="(character encoding)"
        targetCharset="(character encoding)">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
    </handler>

    XML usage example:

    
    <handler
        class="CharsetTransformer"
        sourceCharset="ISO-8859-1"
        targetCharset="UTF-8"/>

    The above example converts the content of a document from "ISO-8859-1" to "UTF-8".

    Since:
    2.5.0
    Author:
    Pascal Essiembre
    See Also:
    CharsetTagger