Class CharsetUtil


  • public final class CharsetUtil
    extends Object
    Character set utility methods.
    Since:
    2.5.0
    Author:
    Pascal Essiembre
    • Method Detail

      • convertCharset

        public static String convertCharset​(String input,
                                            String inputCharset,
                                            String outputCharset)
                                     throws IOException
        Converts the character encoding of the supplied input value.
        Parameters:
        input - input value to apply conversion
        inputCharset - character set of the input value
        outputCharset - desired character set of the output value
        Returns:
        the converted value
        Throws:
        IOException - problem converting character set
      • convertCharset

        public static void convertCharset​(InputStream input,
                                          String inputCharset,
                                          OutputStream output,
                                          String outputCharset)
                                   throws IOException
        Converts the character encoding of the supplied input.
        Parameters:
        input - input stream to apply conversion
        inputCharset - character set of the input stream
        output - where converted stream will be stored
        outputCharset - desired character set of the output stream
        Throws:
        IOException - problem converting character set
      • detectCharset

        public static String detectCharset​(String input)
                                    throws IOException
        Detects the character encoding of a string.
        Parameters:
        input - the input to detect encoding on
        Returns:
        the character encoding official name or null if the input is null or blank
        Throws:
        IOException - if there is a problem find the character encoding
      • detectCharset

        public static String detectCharset​(String input,
                                           String declaredEncoding)
        Detects the character encoding of a string. If the string has a declared character encoding, specifying it will influence the detection result.
        Parameters:
        input - the input to detect encoding on
        declaredEncoding - declared input encoding, if known
        Returns:
        the character encoding official name or null if the input is null or blank
      • detectCharset

        public static String detectCharset​(InputStream input)
                                    throws IOException
        Detects the character encoding of an input stream. InputStream.markSupported() must return true otherwise no decoding will be attempted.
        Parameters:
        input - the input to detect encoding on
        Returns:
        the character encoding official name or null if input is null
        Throws:
        IOException - if there is a problem find the character encoding
      • detectCharset

        public static String detectCharset​(InputStream input,
                                           String declaredEncoding)
                                    throws IOException
        Detects the character encoding of an input stream. If the string has a declared character encoding, specifying it will influence the detection result. InputStream.markSupported() must return true otherwise no decoding will be attempted.
        Parameters:
        input - the input to detect encoding on
        declaredEncoding - declared input encoding, if known
        Returns:
        the character encoding official name or null if input is null
        Throws:
        IOException - if there is a problem find the character encoding
      • detectsCharset

        public static String detectsCharset​(Doc doc)
                                     throws IOException
        Detects a document character encoding. It first checks if it is defined in the document DocInfo.getContentEncoding(). If not, it will attempt to detect it from the document input stream. This method will NOT set the detected encoding on the DocInfo. If unable to detect, UTF-8 is assumed.
        Parameters:
        doc - document to detect encoding on
        Returns:
        string representation of character encoding
        Throws:
        IOException - problem detecting charset
        Since:
        3.0.0
      • detectCharsetIfBlank

        public static String detectCharsetIfBlank​(String charset,
                                                  Doc doc)
                                           throws IOException
        Detects a document character encoding if the supplied charset is blank. When blank, it checks if it is defined in the document DocInfo.getContentEncoding(). If not, it will attempt to detect it from the document input stream. This method will NOT set the detected encoding on the DocInfo. If unable to detect, UTF-8 is assumed.
        Parameters:
        charset - character encoding to use if not blank
        doc - document to detect encoding on
        Returns:
        supplied charset if not blank, or the detected charset
        Throws:
        IOException - problem detecting charset
        Since:
        3.0.0
      • detectCharsetIfBlank

        public static String detectCharsetIfBlank​(String charset,
                                                  InputStream is)
                                           throws IOException
        Detects a document character encoding if the supplied charset is blank. When blank, it will attempt to detect it from the input stream. If unable to detect, UTF-8 is assumed.
        Parameters:
        charset - character encoding to use if not blank
        is - input stream
        Returns:
        supplied charset if not blank, or the detected charset
        Throws:
        IOException - problem detecting charset
        Since:
        3.0.0
      • firstNonBlankOrUTF8

        public static String firstNonBlankOrUTF8​(String... charsets)
        Returns the first non-blank character encoding, or returns UTF-8 if they are all blank.
        Parameters:
        charsets - character encodings to test
        Returns:
        first non-blank, or UTF-8
        Since:
        3.0.0
      • firstNonBlankOrUTF8

        public static String firstNonBlankOrUTF8​(ParseState parseState,
                                                 String... charsets)
        Returns the first non-blank character encoding, or returns UTF-8 if they are all blank or in post-parse state. That is, UTF-8 is always returned if parsing has already occurred (since parsing converts content encoding to UTF-8).
        Parameters:
        parseState - document parsing state
        charsets - character encodings to test
        Returns:
        first non-blank, or UTF-8
        Since:
        3.0.0