Package com.norconex.importer.util
Class CharsetUtil
- java.lang.Object
-
- com.norconex.importer.util.CharsetUtil
-
public final class CharsetUtil extends Object
Character set utility methods.- Since:
- 2.5.0
- Author:
- Pascal Essiembre
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static void
convertCharset(InputStream input, String inputCharset, OutputStream output, String outputCharset)
Converts the character encoding of the supplied input.static String
convertCharset(String input, String inputCharset, String outputCharset)
Converts the character encoding of the supplied input value.static String
detectCharset(InputStream input)
Detects the character encoding of an input stream.static String
detectCharset(InputStream input, String declaredEncoding)
Detects the character encoding of an input stream.static String
detectCharset(String input)
Detects the character encoding of a string.static String
detectCharset(String input, String declaredEncoding)
Detects the character encoding of a string.static String
detectCharsetIfBlank(String charset, Doc doc)
Detects a document character encoding if the suppliedcharset
is blank.static String
detectCharsetIfBlank(String charset, InputStream is)
Detects a document character encoding if the suppliedcharset
is blank.static String
detectsCharset(Doc doc)
Detects a document character encoding.static String
firstNonBlankOrUTF8(ParseState parseState, String... charsets)
Returns the first non-blank character encoding, or returns UTF-8 if they are all blank or in post-parse state.static String
firstNonBlankOrUTF8(String... charsets)
Returns the first non-blank character encoding, or returns UTF-8 if they are all blank.
-
-
-
Method Detail
-
convertCharset
public static String convertCharset(String input, String inputCharset, String outputCharset) throws IOException
Converts the character encoding of the supplied input value.- Parameters:
input
- input value to apply conversioninputCharset
- character set of the input valueoutputCharset
- desired character set of the output value- Returns:
- the converted value
- Throws:
IOException
- problem converting character set
-
convertCharset
public static void convertCharset(InputStream input, String inputCharset, OutputStream output, String outputCharset) throws IOException
Converts the character encoding of the supplied input.- Parameters:
input
- input stream to apply conversioninputCharset
- character set of the input streamoutput
- where converted stream will be storedoutputCharset
- desired character set of the output stream- Throws:
IOException
- problem converting character set
-
detectCharset
public static String detectCharset(String input) throws IOException
Detects the character encoding of a string.- Parameters:
input
- the input to detect encoding on- Returns:
- the character encoding official name or
null
if the input is null or blank - Throws:
IOException
- if there is a problem find the character encoding
-
detectCharset
public static String detectCharset(String input, String declaredEncoding)
Detects the character encoding of a string. If the string has a declared character encoding, specifying it will influence the detection result.- Parameters:
input
- the input to detect encoding ondeclaredEncoding
- declared input encoding, if known- Returns:
- the character encoding official name or
null
if the input is null or blank
-
detectCharset
public static String detectCharset(InputStream input) throws IOException
Detects the character encoding of an input stream.InputStream.markSupported()
must returntrue
otherwise no decoding will be attempted.- Parameters:
input
- the input to detect encoding on- Returns:
- the character encoding official name or
null
if input is null - Throws:
IOException
- if there is a problem find the character encoding
-
detectCharset
public static String detectCharset(InputStream input, String declaredEncoding) throws IOException
Detects the character encoding of an input stream. If the string has a declared character encoding, specifying it will influence the detection result.InputStream.markSupported()
must returntrue
otherwise no decoding will be attempted.- Parameters:
input
- the input to detect encoding ondeclaredEncoding
- declared input encoding, if known- Returns:
- the character encoding official name or
null
if input is null - Throws:
IOException
- if there is a problem find the character encoding
-
detectsCharset
public static String detectsCharset(Doc doc) throws IOException
Detects a document character encoding. It first checks if it is defined in the documentDocInfo.getContentEncoding()
. If not, it will attempt to detect it from the document input stream. This method will NOT set the detected encoding on theDocInfo
. If unable to detect,UTF-8
is assumed.- Parameters:
doc
- document to detect encoding on- Returns:
- string representation of character encoding
- Throws:
IOException
- problem detecting charset- Since:
- 3.0.0
-
detectCharsetIfBlank
public static String detectCharsetIfBlank(String charset, Doc doc) throws IOException
Detects a document character encoding if the suppliedcharset
is blank. When blank, it checks if it is defined in the documentDocInfo.getContentEncoding()
. If not, it will attempt to detect it from the document input stream. This method will NOT set the detected encoding on theDocInfo
. If unable to detect,UTF-8
is assumed.- Parameters:
charset
- character encoding to use if not blankdoc
- document to detect encoding on- Returns:
- supplied charset if not blank, or the detected charset
- Throws:
IOException
- problem detecting charset- Since:
- 3.0.0
-
detectCharsetIfBlank
public static String detectCharsetIfBlank(String charset, InputStream is) throws IOException
Detects a document character encoding if the suppliedcharset
is blank. When blank, it will attempt to detect it from the input stream. If unable to detect,UTF-8
is assumed.- Parameters:
charset
- character encoding to use if not blankis
- input stream- Returns:
- supplied charset if not blank, or the detected charset
- Throws:
IOException
- problem detecting charset- Since:
- 3.0.0
-
firstNonBlankOrUTF8
public static String firstNonBlankOrUTF8(String... charsets)
Returns the first non-blank character encoding, or returns UTF-8 if they are all blank.- Parameters:
charsets
- character encodings to test- Returns:
- first non-blank, or UTF-8
- Since:
- 3.0.0
-
firstNonBlankOrUTF8
public static String firstNonBlankOrUTF8(ParseState parseState, String... charsets)
Returns the first non-blank character encoding, or returns UTF-8 if they are all blank or in post-parse state. That is, UTF-8 is always returned if parsing has already occurred (since parsing converts content encoding to UTF-8).- Parameters:
parseState
- document parsing statecharsets
- character encodings to test- Returns:
- first non-blank, or UTF-8
- Since:
- 3.0.0
-
-