问题:在String和字节数组之间转换时长度是否发生变化?

时间:2018-03-23 09:24:39

标签: java

我认为在byte []和String之间进行转换时输出的长度始终保持不变。但是下面的例子表明这是不可能的。

byte[] b1 = {55, -71, -35, -35, 83, -115, 107, -80, -62, 86, 98, 125, -68, -12, 14, -92, -122, -65, -117, -26, 80, -102, 75, 49, -120, -10, 18, -8, 82, -21, 49, 80, 125, 94, -35, -66, 91, 79, 77, -29, -48, -85, 29, -48, -118, -13, -84, -77, 93, -101, -7, 46, -44, -25, -42, 72, -33, -81, -120, -40, 40, 65, 58, -74, -34, 99, -8, -118, 83, 110, -94, 69, 21, -27, 114, 43, -23, 7, 120, -15, 21, 110, 108, 98, -99, 7, 107, 63, -48, 32, 123, 35, -36, -35, 7, -75, 40, -3, 33, 92, -79, 119, 22, -63, 27, 123, -98, 92, -93, 30, 51, 55, 106, -109, 99, 123, 25, -111, -53, 66, 117, 121, -20, 6, -10, -34, -76, -120, -56, 123, 48, -9, -116, -81, -47, 67, 80, 14, -58, -17, -92, -75, 119, 27, 125, -115, -31, 114, -96, 126, -87, 98, -108, -21, -113, 36, 104, -69, -74, 41, -68, 115, 103, 106, -39, 10, 0, 7, -66, 84, -94, 46, -1, -62, -115, 104, -104, 53, 86, -117, 15, -100, 46, 7, 57, -84, 40, 118, -12, 93, -6, -31, 28, 81, -72, 123, 54, -76, 123, 111, 54, 121, 126, -19, -32, 99, 109, -68, -103, 29, 75, 57, 115, 33, 110, -23, -116, 11, 112, 117, 67, -100, 21, 94, -16, 94, 24, 47, -90, -48, 30, 15, 24, 98, -114, -96, 37, -47, 32, 74, 110, 58, 35, 77, 62, -74, 94, 59, 63, -35, -59, 10, 43, 65, -63, 59, -65, 58, 69, 88, -91, -58, -103, 88, 6, -105, 92, -9, -19, 26, 5, -42, -38, -82, -56, 42, -45, 30, 103, -113, -64, -82, 29, 6, 40, 102, 44, 59, 51, -69, -70, 90, -126, 40, -105, 103, 92, 124, 120, 43, -53, 73, -109, 103, -62, -64, -68, -81, -61, -68, -73, -6, -112, 85, 119, -92, -85, -31, -37, 32, -2, 100, 34, 41, -128, 73, -92, -94, 71, 98, 0, 126, -98, -51, -8, -72, -97, 66, -71, -14, -74, -39, 56, 71, 46, -94, 40, 32, -84, -17, -128, 60, 25, 75, -104, 25, 49, -14, -103, -89, 97, -61, 89, -109, 118, 114, 123, -38, 101, 98, 7, 70, 9, 42, 98, -94, 73, -70, 72, 43, 52, -89, -20, -22, -58, -109, -88, 36, 118, 71, -34, -85, -24, -46, -120, -118, 5, -118, -53, -5, -87, -116, -38, 101, 74, -111, -2, 12, 48, -105, -110, 6, -114, 31, 70, -42, -118, -61, 82, 83, -37, 27, -56, 91, 113, -23, -40, -121, 35, 79, 3, 79, 58, -54, -11, -41, -48, -109, -54, 96, 80, 77, -69, -88, -75, -126, -64, 54, 33, 7, 121, 16, -49, 26, 68, 94, 107, -79, -17, -67, -59, 57, -8, -36, 99, 29, -2, 36, -91, 70, 56, 76, 88, 40, 85, -16, 120, -101, -21, 83, 103, -91, 28, 14, 17, 73, -102, -121, 69, -102, 18, -115, -92, -5, -50, -20};
System.out.println("resultBytes length = " + b1.length);

String s = new String(b1, "utf-8");
System.out.println("cipherText length = " + s.length());

byte[] b2 = s.getBytes("utf-8");
System.out.println("newResultBytes length = " + b2.length);

通过运行它,我得到了输出:

length of b1 = 496
length of s = 470
length of b2 = 877

为什么他们如此不同?

1 个答案:

答案 0 :(得分:1)

在UTF-8编码中,一个字符可能有超过1个字节。

示例:

Character -> Codepoints -> UTF-8 Encoding
ä         -> 00E4       -> C3 A4

因此输入中的2个字节可以在输出中显示为1个字符。

现在使用Unicode可以分解字符(尤其是外语)。因此,为了保留我的示例,可以将字符ä分解为

¨a

现在这是2个具有以下编码的字符

Character -> Codepoints -> UTF-8 Encoding
¨a        -> 00A4 0061  -> C2 A4 61

特别是如果您使用亚洲语言,则此分解会更频繁地发生在此示例中。

因此,对于这个例子(当分解发生时,在每种语言中都不确定),您将获得以下程序输出:

length of b1 = 2
length of s = 1
length of b2 = 3

我认为这可以解释你的发现。