Question

并非所有字节序列都是有效的UTF-8。 UTF-8解码器应该是准备：

1. the red invalid bytes in the above table
2. an unexpected continuation byte
3. a start byte not followed by enough continuation bytes
4. an Overlong Encoding as described above
5. A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF

根据代码页布局，0xC0和0xC1无效，绝不能出现在有效的UTF-8序列中。这是我对0xC0和0xC1的所有内容：

Byte 2   Byte 1      Num   Char
11000011 10000000    192   À
11000011 10000001    193   Á

有些字符对应于这些字节序列，但不应该有。我做错了吗？

Answer 1

你只是混淆了条款：

代码点 U + 00C0是字符“À”，U + 00C1是“Á”。
以UTF-8编码，它们分别是字节序列 C3 80和C3 81。

字节 C0和C1不应出现在UTF-8编码中。

代码点表示与字节无关的字符。字节是字节。

无效的UTF-8代码点

1 个答案: