Question

在Python 2.7中：

In [2]: utf8_str = '\xf0\x9f\x91\x8d'
In [3]: print(utf8_str)

In [4]: unicode_str = utf8_str.decode('utf-8')
In [5]: print(unicode_str)
 
In [6]: unicode_str
Out[6]: u'\U0001f44d'
In [7]: len(unicode_str)
Out[7]: 2

由于unicode_str只包含一个unicode代码点（0x0001f44d），为什么len(unicode_str)返回2而不是1？

Answer 1

您的Python二进制文件是使用UCS-2支持（ narrow 构建）编译的，内部BMP（基本多语言平面）之外的任何内容都使用surrogate pair表示。

这意味着这些代码点在询问长度时会显示为2个字符。

您必须重新编译Python二进制文件才能使用UCS-4，如果这很重要（./configure --enable-unicode=ucs4将启用它），或升级到Python 3.3或更高版本，其中Python's Unicode support was overhauled使用一种可变宽度的Unicode类型，可根据所包含的代码点的要求在ASCII，UCS-2和UCS-4之间切换。

在Python版本2.7和3.0 - 3.2上，您可以通过检查sys.maxunicode value来检测您拥有的构建类型。对于狭窄的UCS-2构建，它2^16-1 == 65535 == 0xFFFF，对于广泛的UCS-4构建，1114111 == 0x10FFFF。{1}}。在Python 3.3及更高版本中，它始终设置为1114111。

演示：

# Narrow build
$ bin/python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
65535 2 [u'\ud83d', u'\udc4d']
# Wide build
$ python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
1114111 1 [u'\U0001f44d']

对于单个Unicode字符串，Python返回长度为2

1 个答案: