Package com.norconex.importer.util
Class CharsetUtil
- java.lang.Object
-
- com.norconex.importer.util.CharsetUtil
-
public final class CharsetUtil extends Object
Character set utility methods.- Since:
- 2.5.0
- Author:
- Pascal Essiembre
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static voidconvertCharset(InputStream input, String inputCharset, OutputStream output, String outputCharset)Converts the character encoding of the supplied input.static StringconvertCharset(String input, String inputCharset, String outputCharset)Converts the character encoding of the supplied input value.static StringdetectCharset(InputStream input)Detects the character encoding of an input stream.static StringdetectCharset(InputStream input, String declaredEncoding)Detects the character encoding of an input stream.static StringdetectCharset(String input)Detects the character encoding of a string.static StringdetectCharset(String input, String declaredEncoding)Detects the character encoding of a string.static StringdetectCharsetIfBlank(String charset, Doc doc)Detects a document character encoding if the suppliedcharsetis blank.static StringdetectCharsetIfBlank(String charset, InputStream is)Detects a document character encoding if the suppliedcharsetis blank.static StringdetectsCharset(Doc doc)Detects a document character encoding.static StringfirstNonBlankOrUTF8(ParseState parseState, String... charsets)Returns the first non-blank character encoding, or returns UTF-8 if they are all blank or in post-parse state.static StringfirstNonBlankOrUTF8(String... charsets)Returns the first non-blank character encoding, or returns UTF-8 if they are all blank.
-
-
-
Method Detail
-
convertCharset
public static String convertCharset(String input, String inputCharset, String outputCharset) throws IOException
Converts the character encoding of the supplied input value.- Parameters:
input- input value to apply conversioninputCharset- character set of the input valueoutputCharset- desired character set of the output value- Returns:
- the converted value
- Throws:
IOException- problem converting character set
-
convertCharset
public static void convertCharset(InputStream input, String inputCharset, OutputStream output, String outputCharset) throws IOException
Converts the character encoding of the supplied input.- Parameters:
input- input stream to apply conversioninputCharset- character set of the input streamoutput- where converted stream will be storedoutputCharset- desired character set of the output stream- Throws:
IOException- problem converting character set
-
detectCharset
public static String detectCharset(String input) throws IOException
Detects the character encoding of a string.- Parameters:
input- the input to detect encoding on- Returns:
- the character encoding official name or
nullif the input is null or blank - Throws:
IOException- if there is a problem find the character encoding
-
detectCharset
public static String detectCharset(String input, String declaredEncoding)
Detects the character encoding of a string. If the string has a declared character encoding, specifying it will influence the detection result.- Parameters:
input- the input to detect encoding ondeclaredEncoding- declared input encoding, if known- Returns:
- the character encoding official name or
nullif the input is null or blank
-
detectCharset
public static String detectCharset(InputStream input) throws IOException
Detects the character encoding of an input stream.InputStream.markSupported()must returntrueotherwise no decoding will be attempted.- Parameters:
input- the input to detect encoding on- Returns:
- the character encoding official name or
nullif input is null - Throws:
IOException- if there is a problem find the character encoding
-
detectCharset
public static String detectCharset(InputStream input, String declaredEncoding) throws IOException
Detects the character encoding of an input stream. If the string has a declared character encoding, specifying it will influence the detection result.InputStream.markSupported()must returntrueotherwise no decoding will be attempted.- Parameters:
input- the input to detect encoding ondeclaredEncoding- declared input encoding, if known- Returns:
- the character encoding official name or
nullif input is null - Throws:
IOException- if there is a problem find the character encoding
-
detectsCharset
public static String detectsCharset(Doc doc) throws IOException
Detects a document character encoding. It first checks if it is defined in the documentDocInfo.getContentEncoding(). If not, it will attempt to detect it from the document input stream. This method will NOT set the detected encoding on theDocInfo. If unable to detect,UTF-8is assumed.- Parameters:
doc- document to detect encoding on- Returns:
- string representation of character encoding
- Throws:
IOException- problem detecting charset- Since:
- 3.0.0
-
detectCharsetIfBlank
public static String detectCharsetIfBlank(String charset, Doc doc) throws IOException
Detects a document character encoding if the suppliedcharsetis blank. When blank, it checks if it is defined in the documentDocInfo.getContentEncoding(). If not, it will attempt to detect it from the document input stream. This method will NOT set the detected encoding on theDocInfo. If unable to detect,UTF-8is assumed.- Parameters:
charset- character encoding to use if not blankdoc- document to detect encoding on- Returns:
- supplied charset if not blank, or the detected charset
- Throws:
IOException- problem detecting charset- Since:
- 3.0.0
-
detectCharsetIfBlank
public static String detectCharsetIfBlank(String charset, InputStream is) throws IOException
Detects a document character encoding if the suppliedcharsetis blank. When blank, it will attempt to detect it from the input stream. If unable to detect,UTF-8is assumed.- Parameters:
charset- character encoding to use if not blankis- input stream- Returns:
- supplied charset if not blank, or the detected charset
- Throws:
IOException- problem detecting charset- Since:
- 3.0.0
-
firstNonBlankOrUTF8
public static String firstNonBlankOrUTF8(String... charsets)
Returns the first non-blank character encoding, or returns UTF-8 if they are all blank.- Parameters:
charsets- character encodings to test- Returns:
- first non-blank, or UTF-8
- Since:
- 3.0.0
-
firstNonBlankOrUTF8
public static String firstNonBlankOrUTF8(ParseState parseState, String... charsets)
Returns the first non-blank character encoding, or returns UTF-8 if they are all blank or in post-parse state. That is, UTF-8 is always returned if parsing has already occurred (since parsing converts content encoding to UTF-8).- Parameters:
parseState- document parsing statecharsets- character encodings to test- Returns:
- first non-blank, or UTF-8
- Since:
- 3.0.0
-
-