Question

我试图理解Java中的字符编码。 Java中的字符使用UTF-16编码以16位存储。因此，当我将包含6个字符的字符串转换为字节时，我得到6个字节，如下所示，我期待它为12.我是否有任何概念缺失？

package learn.java;

public class CharacterTest {

    public static void main(String[] args) {
        String str = "Hadoop";
        byte bt[] = str.getBytes();
        System.out.println("the length of character array is " + bt.length);
    } 
}

O / p：字符数组的长度为6

根据@Darshan尝试使用UTF-16编码获取字节时，结果也不是预期的。

package learn.java;

    public class CharacterTest {

        public static void main(String[] args) {

            String str = "Hadoop";
            try{
                byte bt[] = str.getBytes("UTF-16");
                System.out.println("the length of character array is " + bt.length);

            }
            catch(Exception e)
            {

            }
        } 
    }

o/p: the length of character array is 14

Answer 1

在UTF-16版本中，由于插入了标记以区分Big Endian（默认）和Little Endian，因此可以获得14个字节。如果指定UTF-16LE，则会得到12个字节（little-endian，不添加字节顺序标记）。

请参阅http://www.unicode.org/faq/utf_bom.html#gen7

编辑 - 使用此程序查看不同编码生成的实际字节数：

public class Test {
    public static void main(String args[]) throws Exception {
        // bytes in the first argument, encoded using second argument
        byte[] bs = args[0].getBytes(args[1]);
        System.err.println(bs.length + " bytes:");

        // print hex values of bytes and (if printable), the char itself
        char[] hex = "0123456789ABCDEF".toCharArray();
        for (int i=0; i<bs.length; i++) {
            int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
            System.err.print(hex[b>>4] + "" + hex[b&0xf] 
                + ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
                + ( (i%4 == 3) ? "\n" : " "));
        }
        System.err.println();   
    }
}

例如，在UTF-8下运行时（在其他JVM默认编码下，FE和FF的字符会显示不同），输出为：

$ javac Test.java  && java -cp . Test hello UTF-16
12 bytes:
FEþ FFÿ 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo

和

$ javac Test.java  && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00.

并且

$ javac Test.java  && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo

Answer 2

根据String.getBytes() method's documentation，使用平台的默认字符集将字符串编码为字节序列。

我假设，您的平台默认字符集将是ISO-8859-1（或类似的每字节字符字符集）。这些字符集会将一个字符编码为一个字节。

如果要指定编码，请使用方法String.getBytes(Charset)或String.getBytes(String)。

关于16位存储：这是Java 内部存储字符的方式，也是字符串。它基于最初的Unicode规范。

Answer 3

String.getBytes()使用默认的平台编码。试试这个

byte bt[] = str.getBytes("UTF-16");

Answer 4

我认为这会有所帮助：The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

这也有帮助：“UTF-16（16位Unicode转换格式）是一种字符编码[...]编码是可变长度编码，因为代码点是用一个或两个16位代码单元编码。“ （来自Wikipedia）

Answer 5

对UTF-16编码使用str.getBytes("UTF-16");

但它为字节[]提供了14个长度，请参阅[link] http://rosettacode.org/wiki/String_length以获取更多详细信息。

java的UTF-16字符编码

5 个答案: