在读到Unicode时,我听过很多次,UTF-32是固定宽度编码。
采用固定宽度编码来表示"将源符号映射到设定位数的代码,"并且,假设所讨论的源符号是Unicode代码点,这一切都有意义。但是,如果你认为源符号的基础语言是字形,事情会变得复杂得多。
所以我的问题是,在字面意义上,UTF-32真的是一个固定长度的编码吗?如果没有,那么在这种意义上是否存在可能的固定长度编码?
答案 0 :(得分:5)
其中一条评论引用了Joel Spolsky撰写的The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)文章,该文章于2003年编写。当时,它作为一个警钟(它可能在某些地方仍然存在)。然而,它并非没有(次要的,但重要的)技术问题 - 尽管整个论文('你需要知道Unicode,你需要知道字符串在哪个编码')仍然有效。然后评论继续:
是的,UTF-16和UTF-32都是固定宽度。 UTF-8 ......不是。
UTF-16并不是真正的固定宽度;一些Unicode代码点是一个16位代码单元,其他代码需要两个16位代码单元 - 就像UTF-8不是固定宽度一样;一些Unicode代码点需要一个8位代码单元,其他代码需要两个,三个甚至四个8位代码单元(但不是五个或六个,尽管Joel的文章提到了可能性的评论)。另一方面,UTF-32是固定宽度的;所有Unicode代码点都可以在一个32位代码单元中编码。 (实际上,最大可能的Unicode代码点是U + 10FFFF,因此Unicode是一个21位代码集,但它不使用所有可能的21位组合。)
但是,代码点与字符不同,更不用说字形了。 Unicode常见问题解答中有一节Characters and Combining Marks讨论了字形,引用了glossary定义。
最终用户认为是字符的更好的词是字形(在Unicode词汇表中定义):在特定书写系统的上下文中最低限度的独特写作单元。
字母不一定是字符序列的组合,组合字符序列不一定是字形。
Q: How are characters counted when measuring the length or position of a character in a string?
答:计算Unicode字符串中“字符”的长度或位置可能有点复杂,因为有四种不同的方法,加上组合字符可能造成的混乱。正确选择使用哪种计数方法取决于计算的内容以及计数或位置的用途。
要解决这个问题:
如果你的意思是'可能需要多个Unicode代码点来获得一个完整的字符(字形)和相关的变音符号(组合标记等)'然后是的,即使是UTF-32也不一定是固定的宽度并且Unicode没有固定的宽度编码。
UTF-32对每个Unicode代码点采用固定宽度编码,但由于它可以使用多个代码点来创建完整的字形,因此即使UTF-32在代码点和字形之间也没有1:1的映射。
当然,您也可以在SO的一些评论中找到有趣的字符堆栈。例如:
@̮̘̮̜̤͓͓̓ͪ̓͆͗̑Ṷ̫̠̤̙̻͚̗ͭs̹͓̰̫͉̺̈̏̽̅̑ͩ̇̓̉e͖̝̦̦̿r͔̒̿̋̓n̹͖̥ͥͦͤ̍͊̏ä͇͖͚͖͊m̭͇͆͋̋͒e̫̠͇̰̦̹̫̠͇̰̦̹͗͋̓̿͒͗͋̓̿͒B̜̥̣̬̮͈͒̄ͪ͊l̮͉̣̟̪̪̿̍ͫ͋͐̑a̜̦̪͗͗̈ͣ͊ḫ̘̯͈̠̞̜̦̪͒ͯ͗͗̈ͣ͊ḫ̘̯͈̠̞͒ͯb̖̣͇̖̦̑ͬͭͥl͔͍͚͕̪̼͎ͧ̇̏ạ̖̪͚̯̊ͤͣͦͮ̌h̘͓͔̟͔͍̘͓͔̟͔͍̏ͣͦ̓̓̏ͣͦ̓̓b̙͍̼̜͍̹̬̬͎ͥ̓ͯḽ̜̟̾̅̆ͦͨa͇̰̝̺͊ͧͫ͛h̯̻͉̉̒̉̈ͥ.̖̩̭͇̭͔̹̈̇͐ͬͦͦͨ̾̇.͍̪̣ͬ.̞͍̥̪̺̤̣̜͆ͫ̈͑ͦ͑͑
Why/how do "Zalgo pings" work?
Ȩ̸҉̟͎͚̹͚̙̟̖x̨͙̰͕̖͉̼̜̦̟͈ą̷̘͕͈̹͓̣̮̼̣̠̹c̼͙̠̭̫̰͈͍̮͢͡ţ̢̛̠͇̬̖̟̺͈̻̣͙͈̼͍̘l̶̶͘͘y̭̖̰͚̞̣̗̳̠͕̻̼͡!̛̛͖̮͔͍̰͉͖̮͔͍̰͉͢͢O҉҉̣̜̺̪̳͕̖͔̠͙͎͕̙̦n̩͓͖̝̟̭͙͙͓͚̼͖͖͜͞ȩ̧̬̦̠̙̥͇͔̪̩͓͖̝̟̭͙͙͓͚̼͖͖͜͞ȩ̧̬̦̠̙̥͇͔̪i̴͞͏̩̤̹̗̖̰͎̖̘͓̗̯͚̞͖̥̻͝s͞҉͈̙̹̤̫͇͞҉͈̙̹̤̫͇e̷̪̭̯̼͓͎̹̠͖͔̪͈̦͈͍̭̩͠ņ͞҉̮̳͓͙͈̼͉̬͕͈̺͈̭̩̪o͇̗̠̠̯̕͢u̸̸̳̦̩̳̫̖̜̳̦̩̳̫̖̜h̸̛̩͚̮̤̖̹͙.̶̨̳̖̠̗̼̩͕͇͉͓̟̦͜͞
您看到的内容当然取决于浏览器中Unicode支持的质量(反过来,这部分取决于O / S支持的质量)。我可以在运行相当不同版本的Firefox的两台不同Mac上看到不同的结果,即使它们运行相同的基础O / S版本(10.10.4 Yosemite)。
这些示例中的第二个可以从UTF-8解码为以下Unicode代码点序列 - 磁盘上只有700个字节:
0xC8 0xA8 = U+0228
0xCC 0xB8 = U+0338
0xD2 0x89 = U+0489
0xCC 0x9F = U+031F
0xCD 0x8E = U+034E
0xCD 0x9A = U+035A
0xCC 0xB9 = U+0339
0xCD 0x9A = U+035A
0xCC 0x99 = U+0319
0xCC 0x9F = U+031F
0xCC 0x96 = U+0316
0x78 = U+0078
0xCC 0xA8 = U+0328
0xCD 0x99 = U+0359
0xCC 0xB0 = U+0330
0xCD 0x95 = U+0355
0xCC 0x96 = U+0316
0xCD 0x89 = U+0349
0xCC 0xBC = U+033C
0xCC 0x9C = U+031C
0xCC 0xB2 = U+0332
0xCC 0xA6 = U+0326
0xCC 0x9F = U+031F
0xCD 0x88 = U+0348
0xCC 0x81 = U+0301
0xCD 0x85 = U+0345
0xCD 0x85 = U+0345
0xC4 0x85 = U+0105
0xCC 0xB7 = U+0337
0xCC 0x98 = U+0318
0xCD 0x95 = U+0355
0xCD 0x88 = U+0348
0xCC 0xB9 = U+0339
0xCD 0x93 = U+0353
0xCC 0xA3 = U+0323
0xCC 0xAE = U+032E
0xCC 0xBC = U+033C
0xCC 0xA3 = U+0323
0xCC 0xA0 = U+0320
0xCC 0xB9 = U+0339
0xCC 0x81 = U+0301
0x63 = U+0063
0xCC 0xBC = U+033C
0xCD 0x99 = U+0359
0xCC 0xA0 = U+0320
0xCC 0xAD = U+032D
0xCC 0xAB = U+032B
0xCC 0xB0 = U+0330
0xCD 0x88 = U+0348
0xCD 0x8D = U+034D
0xCC 0xAE = U+032E
0xCD 0xA2 = U+0362
0xCD 0xA1 = U+0361
0xC5 0xA3 = U+0163
0xCC 0xA2 = U+0322
0xCC 0x9B = U+031B
0xCC 0xA0 = U+0320
0xCD 0x87 = U+0347
0xCC 0xAC = U+032C
0xCC 0x96 = U+0316
0xCC 0x9F = U+031F
0xCC 0xBA = U+033A
0xCD 0x88 = U+0348
0xCC 0xB2 = U+0332
0xCC 0xBB = U+033B
0xCC 0xA3 = U+0323
0xCC 0xB2 = U+0332
0xCD 0x99 = U+0359
0xCD 0x88 = U+0348
0xCC 0xBC = U+033C
0xCD 0x8D = U+034D
0xCC 0x98 = U+0318
0xCC 0xB1 = U+0331
0xCD 0x85 = U+0345
0x6C = U+006C
0xCC 0xB6 = U+0336
0xCD 0x98 = U+0358
0xE2 0x80 0x8C = U+200C
0xE2 0x80 0x8B = U+200B
0xCC 0xB7 = U+0337
0xCC 0xA8 = U+0328
0xCC 0xB2 = U+0332
0xCD 0x99 = U+0359
0xCD 0x96 = U+0356
0xCC 0xBB = U+033B
0xCC 0xB2 = U+0332
0xCC 0x97 = U+0317
0xCC 0xA6 = U+0326
0xCD 0x9A = U+035A
0xCD 0x99 = U+0359
0xCC 0xAE = U+032E
0xCD 0xA0 = U+0360
0x79 = U+0079
0xCC 0xAD = U+032D
0xCC 0x96 = U+0316
0xCC 0xB0 = U+0330
0xCD 0x9A = U+035A
0xCC 0x9E = U+031E
0xCC 0xA3 = U+0323
0xCC 0x97 = U+0317
0xCC 0xB3 = U+0333
0xCC 0xA0 = U+0320
0xCD 0x95 = U+0355
0xCC 0xBB = U+033B
0xCC 0xBC = U+033C
0xCD 0xA1 = U+0361
0xCD 0x85 = U+0345
0x21 = U+0021
0xCC 0x9B = U+031B
0xCD 0x96 = U+0356
0xCC 0xAE = U+032E
0xCD 0x94 = U+0354
0xCD 0x8D = U+034D
0xCC 0xB0 = U+0330
0xCD 0x89 = U+0349
0xCD 0xA2 = U+0362
0x20 = U+0020
0xCC 0xAD = U+032D
0xCC 0x99 = U+0319
0xCC 0x96 = U+0316
0xCD 0x94 = U+0354
0xCC 0xA9 = U+0329
0xCC 0x97 = U+0317
0xCC 0xA0 = U+0320
0xCD 0x95 = U+0355
0xCC 0xA6 = U+0326
0xCC 0xAC = U+032C
0xCD 0x93 = U+0353
0xCD 0x9E = U+035E
0xCD 0x9D = U+035D
0xCD 0x85 = U+0345
0x4F = U+004F
0xD2 0x89 = U+0489
0xD2 0x89 = U+0489
0xCC 0xA3 = U+0323
0xCC 0x9C = U+031C
0xCC 0xBA = U+033A
0xCC 0xAA = U+032A
0xCC 0xB3 = U+0333
0xCD 0x95 = U+0355
0xCC 0x96 = U+0316
0xCD 0x94 = U+0354
0xCC 0xA0 = U+0320
0xCD 0x99 = U+0359
0xCD 0x8E = U+034E
0xCD 0x95 = U+0355
0xCC 0x99 = U+0319
0xCC 0xA6 = U+0326
0xCD 0x85 = U+0345
0x6E = U+006E
0xCC 0xA9 = U+0329
0xCD 0x93 = U+0353
0xCD 0x96 = U+0356
0xCC 0x9D = U+031D
0xCC 0x9F = U+031F
0xCC 0xAD = U+032D
0xCD 0x99 = U+0359
0xCD 0x99 = U+0359
0xCD 0x93 = U+0353
0xCD 0x9A = U+035A
0xCC 0xBC = U+033C
0xCD 0x96 = U+0356
0xCD 0x96 = U+0356
0xCD 0x9C = U+035C
0xCD 0x9E = U+035E
0xC8 0xA9 = U+0229
0xCC 0xA7 = U+0327
0xCC 0xAC = U+032C
0xCC 0xB1 = U+0331
0xCC 0xA6 = U+0326
0xCC 0xA0 = U+0320
0xCC 0x99 = U+0319
0xCC 0xA5 = U+0325
0xCD 0x87 = U+0347
0xCD 0x94 = U+0354
0xCC 0xAA = U+032A
0xCC 0x81 = U+0301
0x20 = U+0020
0xD2 0x89 = U+0489
0xCC 0xB8 = U+0338
0xCC 0x97 = U+0317
0xCC 0xA6 = U+0326
0xCD 0x87 = U+0347
0xCC 0xB0 = U+0330
0xCC 0xAA = U+032A
0xCC 0xB0 = U+0330
0xCC 0xAD = U+032D
0xCC 0x98 = U+0318
0xCC 0xB9 = U+0339
0xCD 0x98 = U+0358
0xCD 0xA2 = U+0362
0x69 = U+0069
0xCC 0xB4 = U+0334
0xCD 0x9E = U+035E
0xCD 0x8F = U+034F
0xCC 0xA9 = U+0329
0xCC 0xA4 = U+0324
0xCC 0xB9 = U+0339
0xCC 0x97 = U+0317
0xCC 0x96 = U+0316
0xCC 0xB0 = U+0330
0xCD 0x8E = U+034E
0xCC 0x96 = U+0316
0xCC 0xB2 = U+0332
0xCC 0xB2 = U+0332
0xCC 0x98 = U+0318
0xCD 0x93 = U+0353
0xCC 0x97 = U+0317
0xCC 0xAF = U+032F
0xCD 0x9A = U+035A
0xCC 0x9E = U+031E
0xCD 0x96 = U+0356
0xCC 0xA5 = U+0325
0xCC 0xBB = U+033B
0xCD 0x9D = U+035D
0x73 = U+0073
0xCD 0x9E = U+035E
0xD2 0x89 = U+0489
0xCC 0xB2 = U+0332
0xCD 0x88 = U+0348
0xCC 0x99 = U+0319
0xCC 0xB9 = U+0339
0xCC 0xA4 = U+0324
0xCC 0xAB = U+032B
0xCD 0x87 = U+0347
0x20 = U+0020
0xCD 0x9A = U+035A
0xCC 0xAD = U+032D
0xCD 0x8E = U+034E
0xCD 0x89 = U+0349
0xCC 0xA0 = U+0320
0xCC 0xBA = U+033A
0xCD 0x89 = U+0349
0xCC 0xAE = U+032E
0xCC 0x9E = U+031E
0xCC 0xBB = U+033B
0xCC 0xA3 = U+0323
0xCC 0xB0 = U+0330
0xCC 0xBA = U+033A
0xCC 0x96 = U+0316
0xCD 0x96 = U+0356
0xCC 0x80 = U+0300
0xCC 0x81 = U+0301
0xCD 0xA2 = U+0362
0xCD 0x9E = U+035E
0x65 = U+0065
0xCC 0xB7 = U+0337
0xCC 0xAA = U+032A
0xCC 0xAD = U+032D
0xCC 0xAF = U+032F
0xCC 0xBC = U+033C
0xCD 0x93 = U+0353
0xCD 0x8E = U+034E
0xCC 0xB9 = U+0339
0xCC 0xA0 = U+0320
0xCD 0x96 = U+0356
0xCC 0xB2 = U+0332
0xCD 0x94 = U+0354
0xCC 0xAA = U+032A
0xCD 0x88 = U+0348
0xCC 0xA6 = U+0326
0xCD 0x88 = U+0348
0xCC 0xB1 = U+0331
0xCD 0x8D = U+034D
0xCC 0xAD = U+032D
0xCC 0xA9 = U+0329
0xCD 0xA0 = U+0360
0xC5 0x86 = U+0146
0xCD 0x9E = U+035E
0xD2 0x89 = U+0489
0xCC 0xAE = U+032E
0xCC 0xB3 = U+0333
0xCD 0x93 = U+0353
0xCD 0x99 = U+0359
0xCD 0x88 = U+0348
0xCC 0xBC = U+033C
0xCD 0x89 = U+0349
0xCC 0xAC = U+032C
0xCD 0x95 = U+0355
0xCD 0x88 = U+0348
0xCC 0xBA = U+033A
0xCD 0x88 = U+0348
0xCC 0xAD = U+032D
0xCC 0xA9 = U+0329
0xCC 0xAA = U+032A
0x6F = U+006F
0xCD 0x87 = U+0347
0xCC 0x97 = U+0317
0xCC 0xB1 = U+0331
0xCC 0xA0 = U+0320
0xCC 0xB1 = U+0331
0xCC 0xA0 = U+0320
0xCC 0xAF = U+032F
0xCC 0x95 = U+0315
0xCD 0xA2 = U+0362
0x75 = U+0075
0xCC 0xB8 = U+0338
0xCC 0xB3 = U+0333
0xCC 0xA6 = U+0326
0xCC 0xA9 = U+0329
0xCC 0xB3 = U+0333
0xCC 0xAB = U+032B
0xCC 0x96 = U+0316
0xCC 0x9C = U+031C
0xCD 0x85 = U+0345
0xE2 0x80 0x8C = U+200C
0xE2 0x80 0x8B = U+200B
0xC7 0xB5 = U+01F5
0xCC 0xA2 = U+0322
0xCC 0xB2 = U+0332
0xCC 0xA3 = U+0323
0xCD 0x8E = U+034E
0xCC 0xAE = U+032E
0xCC 0xAE = U+032E
0xCC 0xBC = U+033C
0xCC 0xAB = U+032B
0xCC 0xA5 = U+0325
0xCC 0xA0 = U+0320
0xCD 0x99 = U+0359
0xCC 0xB1 = U+0331
0xCC 0x9D = U+031D
0xCC 0x98 = U+0318
0xCD 0x95 = U+0355
0xCD 0x8E = U+034E
0xCC 0xB3 = U+0333
0xCC 0x9C = U+031C
0xCC 0xB2 = U+0332
0xCC 0x96 = U+0316
0x68 = U+0068
0xCC 0xB8 = U+0338
0xCC 0x9B = U+031B
0xCC 0xA9 = U+0329
0xCD 0x9A = U+035A
0xCC 0xAE = U+032E
0xCC 0xA4 = U+0324
0xCC 0x96 = U+0316
0xCC 0xB9 = U+0339
0xCD 0x99 = U+0359
0x2E = U+002E
0xCC 0xB6 = U+0336
0xCC 0xA8 = U+0328
0xCC 0xB3 = U+0333
0xCC 0x96 = U+0316
0xCC 0xA0 = U+0320
0xCC 0x97 = U+0317
0xCC 0xBC = U+033C
0xCC 0xA9 = U+0329
0xCD 0x95 = U+0355
0xCD 0x87 = U+0347
0xCD 0x89 = U+0349
0xCD 0x93 = U+0353
0xCC 0x9F = U+031F
0xCC 0xA6 = U+0326
0xCD 0x9C = U+035C
0xCD 0x9E = U+035E
0xCD 0x85 = U+0345
0x0A = U+000A
解密哪些部分是字形是很棘手的,但很明显是所有堆叠的字符,这不是每个字素的固定数据量,并且没有理智的方法使Unicode工作的固定宽度编码每个grapheme因为,正如'Zalgo'的例子所示,组合标记基本上可以按任意顺序组合。
第二个'Zalgo'示例中的第一个字素包含:
0xC8 0xA8 = U+0228 LATIN CAPITAL LETTER E WITH CEDILLA
0xCC 0xB8 = U+0338 COMBINING LONG SOLIDUS OVERLAY
0xD2 0x89 = U+0489 CYRILLIC COMBINING MILLIONS SIGN
0xCC 0x9F = U+031F COMBINING PLUS SIGN BELOW
0xCD 0x8E = U+034E COMBINING UPWARDS ARROW BELOW
0xCD 0x9A = U+035A COMBINING DOUBLE RING BELOW
0xCC 0xB9 = U+0339 COMBINING RIGHT HALF RING BELOW
0xCD 0x9A = U+035A COMBINING DOUBLE RING BELOW
0xCC 0x99 = U+0319 COMBINING RIGHT TACK BELOW
0xCC 0x9F = U+031F COMBINING PLUS SIGN BELOW
0xCC 0x96 = U+0316 COMBINING GRAVE ACCENT BELOW
下一个代码点是U + 0078 LATIN SMALL LETTER X,一个新字母的开头。在该列表中,每个组合标记会出现几次。
答案 1 :(得分:0)
UTF-32是固定宽度编码,顺便说一句,唯一的Unicode编码将DWORD值直接映射到Unicode代码点。但是存在值的限制,最高值为0x10FFFF,并且整个高代理范围和低代理范围在UTF-32中无效。