为什么没有UTF-24?

时间:2012-04-13 15:32:29

标签: unicode character-encoding utf-32

  

可能重复:
  Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

UTF-32中的最大Unicode代码点为0x10FFFF。 UTF-32有21个信息位和11个多余的空白位。那么为什么没有UTF-24编码(即删除了高字节的UTF-32)将每个代码点存储在3个字节而不是4个字节中?

1 个答案:

答案 0 :(得分:21)

嗯,事实是:2007年建议使用UTF-24:

http://unicode.org/mail-arch/unicode-ml/y2007-m01/0057.html

提到的专业人士&缺点是:

"UTF-24 
Advantages: 
 1. Fixed length code units. 
 2. Encoding format is easily detectable for any content, even if mislabeled. 
 3. Byte order can be reliably detected without the use of BOM, even for single-code-unit data. 
 4. If octets are dropped / inserted, decoder can resync at next valid code unit. 
 5. Practical for both internal processing and storage / interchange. 
 6. Conversion to code point scalar values is more trivial then for UTF-16 surrogate pairs 
    and UTF-7/8 multibyte sequences. 
 7. 7-bit transparent version can be easily derived. 
 8. Most compact for texts in archaic scripts. 
Disadvantages: 
 1. Takes more space then UTF-8/16, except for texts in archaic scripts. 
 2. Comparing to UTF-32, extra bitwise operations required to convert to code point scalar values. 
 3. Incompatible with many legacy text-processing tools and protocols. "

David Starner在http://www.mail-archive.com/unicode@unicode.org/msg16011.html中指出:

  

为什么呢?除非你,UTF-24几乎总是比UTF-16大   正在用Old Italic或Gothic谈论一个文件。数学上的字母数字   字符几乎总是与足够的ASCII组合   UTF-8一场胜利,如果没有,足够的BMP角色让UTF-16获胜。   现代计算机不能很好地处理24位块;在记忆中,他们是   每件32位,除非你宣布它们已经打包,然后   它们比UTF-16或UTF-32慢很多。如果你要存储   磁盘,你也可以使用BOCU或SCSU(你已经去了   非标准的,或使用标准压缩与UTF-8,UTF-16,BOCU或   SCSU。压缩的SCSU或BOCU应占UTF-24的一半空间,   如果那样。

您还可以查看以下StackOverflow帖子:

Why UTF-32 exists whereas only 21 bits are necessary to encode every character?