Question

作为MySQL在遇到“高”（序数＆gt; = 2 ^ 16）代码点时截断unicode字符串的一种解决方法，我一直在使用一个逐步完成字符串的小Python方法（字符串是序列，请记住），对字符执行ord（），并通过替换其他内容或直接删除代码点来取代截断。这在许多使用Python 2.7.3（Ubuntu 12.04 LTS，一些Centos 6，混合32位和64位CPU，到目前为止还不重要）的机器上已经按预期工作。

我注意到在Python 2.7.6安装上，这会中断。 Ascii字符和“低”代码点（序数＆lt; 2 ^ 16）表现如前。但是高码点（＆gt; = 2 ^ 16）表现得非常奇怪。 Python2.7.6似乎将它们视为两个代码点。这是一个基础知识的测试案例：

### "good" machine, Python2.7.3
$ uname -a && echo $LANG
Linux *** 3.2.0-60-virtual #91-Ubuntu SMP Wed Feb 19 04:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
en_US.UTF-8
$ python2.7
Python 2.7.3 (default, Feb 27 2014, 19:58:35) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> utest = u'a\u0395\U0001f30e'    # three chars: ascii, "low" codepoint, "high" codepoint
>>> utest.__class__
<type 'unicode'>
>>> len(utest), hash(utest)
(3, 1453079728409075183)
>>> list(utest)        # split into list of single chars
[u'a', u'\u0395', u'\U0001f30e']
>>> utest[2]   # trying to extract third char (high codepoint)
u'\U0001f30e'
>>> len(utest[2])
1
>>> "%x" % ord(utest[2])
'1f30e'

这是预期的行为。我用三个字符初始化一个unicode字符串。 Python说它是三个字符，它可以“解决”第三个字符，返回单个预期的高代码点。如果我得到该代码点的序数，我会得到与原始转义序列中相同的数字。

现在是Python 2.7.6

### "bad" machine, Python 2.7.6
$ uname -a && echo $LANG
Linux *** 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
en_US.UTF-8
$ python2.7
Python 2.7.6 (default, Jan 29 2014, 20:05:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> utest = u'a\u0395\U0001f30e'
>>> utest.__class__
<type 'unicode'>
>>> len(utest), hash(utest)    # !!!
(4, -2836525916470507760)

第一个差异：Python 2.7.6表示utest的长度为4.散列也不同。接下来的惊喜：

>>> list(utest)                # !!!
[u'a', u'\u0395', u'\ud83c', u'\udf0e']

不仅长度表现得很奇怪，分裂成单个字符甚至更奇怪，因为高码点的两个“半”变成了两个低码点，没有明显的数字关系 - 至少对我而言 - 原始码点。

通过序列索引来解决该代码点会出现相同的破坏：

>>> utest[2]
u'\ud83c'

要获得原始的高代码点，我现在必须使用两个字符的切片：

>>> utest[2:4]
u'\U0001f30e'

但是，如果不是很明显，Python2.7.6仍然在内部将其视为两个代码点。我没办法从中得到一个序数。

>>> len(utest[2:4])
2
>>> "%x" % ord(utest[2:4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

那么，该怎么办？我的代码取决于unicode字符串中的代码点的序数。如果一个代码点有时真的是两个代码点，那么我的序数就变得毫无意义了，我的代码也无法执行它的功能。

这种行为有理由吗？这是故意改变吗？是否有一些配置旋钮我可以在Python内部或系统级别恢复旧的行为？猴子补丁？我不知道在哪里看。

不幸的是，我甚至无法将其缩小到精确的次要版本。我们有很多2.7.3，一些2.7.1和几个2.7.6安装。否2.7.4 / 2.7.5。我只能说，我在任何2.7.3安装上都没有遇到过这个问题。

奖励信息：将字符串编码为utf8会从两个Python版本（相同的字符，相同的长度，相同的哈希）产生完全相同的响应。再次解码编码的utf8仍然会让我回到正方形1（即它不是解决方法，行为在unicode空间中仍然存在分歧）。

Answer 1

您正在体验所谓的＃34;代理对＆＃34;。这些只发生在py narrow builds上，其中代码点内部存储为UTF-16。您可以通过选中sys.maxunicode确认您拥有的版本（它将是2 ** 16 - 1）。

其他一些好读物是PEP 393，不幸的是，这使得它得以休息......对于python 3.3+。

编辑：谷歌搜索解决方法。 Full credit to @dan04

def code_points(text):
    import struct
    utf32 = text.encode('UTF-32LE')
    return struct.unpack('<{}I'.format(len(utf32) // 4), utf32)

>>> len(utest)
4
>>> len(code_points(utest))
3

如果你只关心你可以做len(utest.encode('UTF-32LE')) // 4的长度，但看起来你想要做更多，所以上述功能也许有用。

Python 2.7.6分割单个＆＃34;高＆＃34; unicode代码点分为两部分

1 个答案: