Question

在http请求之后，我有一个用utf-8编码的字节数组，例如：

byte[] array = new byte[]{0xc3, 0xa4, 0xc2, 0x96}

我使用解码字节数组 new String(array, "UTF-8")。

在示例中，第一个解码的char是0xe4，表示Unicode中的字母ä - 到目前为止没问题。第二个字符0x96代表Windows-1252中的短划线–，而它代表Unicode中名为spa start of guarded area的控制字符。

由于Java将char解释为Unicode，因此我得到了一些不可见的字符。

我的问题：如何正确解码字节数组以获取ä–（Unicode中为0xe4 0x2013）。

先谢谢你的帮助:)。

Answer 1

您的服务器似乎将ISO-Latin-1编码与专有的Windows-1252代码页混淆，编码数据就是这样的结果。 Windows-1252代码页与ISO-Latin-1的几个地方不同。

您可以通过将数据转换回服务器使用错误假定的Latin-1看到的字节来修复数据，然后将其解释为CP1252，如下所示：

String string = new String(array, "UTF-8");
byte[] fix = string.getBytes(StandardCharsets.ISO_8859_1);
string = new String(fix, "Windows-1252");