Question

我使用以下代码以UTF-8格式将字符串写入流。我在字符串的字节前面加上一个带符号的short，然后我把它们写出来。有一个例外：我不能以0x0010作为前缀，因为它是最终格式的关键字。但是我必须确保读者最终得到与str参数完全相同的字符串，即使它的长度是0x0010。

public static void writeString(DataOutputStream out,String str) throws IOException{
    byte[] bytes = str.getBytes(CHARSET_UTF_8);
    if(bytes.length > Short.MAX_VALUE){
        throw new IOException();
    }
    short len = (short)bytes.length;
    if(bytes.length == 0x0010){
        len++;
    }
    out.writeShort(len);
    out.write(bytes);
    if(bytes.length == 0x0010){
        out.write(DEAD_BYTE);
    }
}
public static final Charset CHARSET_UTF_8 = Charset.forName("UTF-8");

UTF-8在字符串末尾识别出是否有任何字节（256个字符）？

此外，以下问题对我没有帮助。我最后得到了一个?字符。 30025693

Answer 1

默认情况下，您放入UTF-8字符串的任何内容都将被解码为某个字符。如果它不是有效的UTF-8序列，将使用替换字符（�） - 并且仍会出现在您的输出中。

您可以从输出字符串中删除�，但它也可以来自输入字符串。相反，你应该从编码的UTF-8字节中剥离多余的字节：

static String readString(final DataInputStream in) throws IOException {
    int len = in.readUnsignedShort();
    final byte[] bytes = new byte[len];
    in.read(bytes);
    if (bytes[len - 1] == -1) {
        len--;
    }
    return new String(bytes, 0, len, UTF_8);
}

另一种选择是在编码长度时跳过0x0010并将所有值上移1：

static void writeString(final DataOutputStream out, final String str) throws IOException {
    final byte[] bytes = str.getBytes(UTF_8);
    short len = (short) bytes.length;
    if (bytes.length >= 0x0010) {
        len++;
    }
    out.writeShort(len);
    out.write(bytes);
}

static String readString(final DataInputStream in) throws IOException {
    int len = in.readUnsignedShort();
    if (len == 0x0010) {
        throw new IllegalStateException();
    } else if (len > 0x0010) {
        len--;
    }
    final byte[] bytes = new byte[len];
    in.read(bytes);
    return new String(bytes, UTF_8);
}

这些解决方案都是黑客攻击，将来可能会造成麻烦。正确的解决方案是消除这种人为限制：

如果您控制最终格式，请重新设计，以便允许任何字节序列。
否则，如果仅在第一个位置禁止0x0010，则始终在其中放置一个常量值，然后是实际长度。（例如：00 11 00 10 ...）
否则，如果0x0010无法在任何位置展示，请将其转义：\x00\x10编码为\\n，\编码为\\

最后0x0010看起来像UTF-16编码的新行。如果确实如此，你不应该将二进制数据放在文本中 - 它会导致更多的问题。在这种情况下，您应该将字符串直接放在UTF-16编码的文本中，或使用像base64这样的ASCII安全编码。

是否有UTF-8无法识别的字节？

1 个答案: