Class CharsetTransformer
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.transformer.AbstractDocumentTransformer
-
- com.norconex.importer.handler.transformer.impl.CharsetTransformer
-
- All Implemented Interfaces:
IXMLConfigurable
,IImporterHandler
,IDocumentTransformer
public class CharsetTransformer extends AbstractDocumentTransformer implements IXMLConfigurable
Transforms a document content (if needed) from a source character encoding (charset) to a target one. Both the source and target character encodings are optional. If no source character encoding is explicitly provided, it first tries to detect the encoding of the document content before converting it to the target encoding. If the source character encoding cannot be established, the content encoding will remain unchanged. When no target character encoding is specified, UTF-8 is assumed.
Should I use this transformer?
Before using this transformer, you need to know the parsing of documents by the importer using default document parser factory will try to convert and return content as UTF-8 (for most, if not all content-types). If UTF-8 is your desired target, it only make sense to use this transformer as a pre-parsing handler (for text content-types only) when it is important to work with a specific character encoding before parsing. If on the other hand you wish to convert to a character encoding to a target different than UTF-8, you can use this transformer as a post-parsing handler to do so.
Conversion is not flawless
Because character encoding detection is not always accurate and because documents sometime mix different encoding, there is no guarantee this class will handle ALL character encoding conversions properly.
XML configuration usage:
<handler class="com.norconex.importer.handler.transformer.impl.CharsetTransformer" sourceCharset="(character encoding)" targetCharset="(character encoding)"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> </handler>
XML usage example:
<handler class="CharsetTransformer" sourceCharset="ISO-8859-1" targetCharset="UTF-8"/>
The above example converts the content of a document from "ISO-8859-1" to "UTF-8".
- Since:
- 2.5.0
- Author:
- Pascal Essiembre
- See Also:
CharsetTagger
-
-
Field Summary
Fields Modifier and Type Field Description static String
DEFAULT_TARGET_CHARSET
-
Constructor Summary
Constructors Constructor Description CharsetTransformer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
String
getSourceCharset()
String
getTargetCharset()
int
hashCode()
protected void
loadHandlerFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveHandlerToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setSourceCharset(String sourceCharset)
void
setTargetCharset(String targetCharset)
String
toString()
protected void
transformApplicableDocument(HandlerDoc doc, InputStream input, OutputStream output, ParseState parseState)
-
Methods inherited from class com.norconex.importer.handler.transformer.AbstractDocumentTransformer
transformDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML
-
-
-
-
Field Detail
-
DEFAULT_TARGET_CHARSET
public static final String DEFAULT_TARGET_CHARSET
-
-
Method Detail
-
transformApplicableDocument
protected void transformApplicableDocument(HandlerDoc doc, InputStream input, OutputStream output, ParseState parseState) throws ImporterHandlerException
- Specified by:
transformApplicableDocument
in classAbstractDocumentTransformer
- Throws:
ImporterHandlerException
-
getTargetCharset
public String getTargetCharset()
-
setTargetCharset
public void setTargetCharset(String targetCharset)
-
getSourceCharset
public String getSourceCharset()
-
setSourceCharset
public void setSourceCharset(String sourceCharset)
-
loadHandlerFromXML
protected void loadHandlerFromXML(XML xml)
Description copied from class:AbstractImporterHandler
Loads configuration settings specific to the implementing class.- Specified by:
loadHandlerFromXML
in classAbstractImporterHandler
- Parameters:
xml
- XML configuration
-
saveHandlerToXML
protected void saveHandlerToXML(XML xml)
Description copied from class:AbstractImporterHandler
Saves configuration settings specific to the implementing class.- Specified by:
saveHandlerToXML
in classAbstractImporterHandler
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractImporterHandler
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractImporterHandler
-
toString
public String toString()
- Overrides:
toString
in classAbstractImporterHandler
-
-