Question

有更好的方法吗？

$ python
Python 2.7.9 (default, Jul 16 2015, 14:54:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-55)] on linux2

Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub(u'[\U0001d300-\U0001d356]', "", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fast/services/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/home/fast/services/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

Answer 1

Python窄版和宽版（Python版本低于3.3）

该错误表明您正在使用＆＃34; narrow＆＃34; （UCS-2）构建，仅支持最多65535的Unicode代码点作为一个＆＃34; Unicode字符＆＃34; ¹。代码点高于65536的字符表示为代理项对，这意味着Unicode字符串u'\U0001d300'由两个＆＃34; Unicode字符组成＆＃34;在狭窄的建设。

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import sys; sys.maxunicode
65535
>>> len(u'\U0001d300')
2
>>> [hex(ord(i)) for i in u'\U0001d300']
['0xd834', '0xdf00']

广泛＆＃34; （UCS-4）构建，所有1114111代码点都被识别为Unicode字符，因此Unicode字符串u'\U0001d300'只包含一个＆＃34; Unicode字符＆＃34; / Unicode代码点。

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import sys; sys.maxunicode
1114111
>>> len(u'\U0001d300')
1
>>> [hex(ord(i)) for i in u'\U0001d300']
['0x1d300']

^{¹我使用＆＃34; Unicode字符＆＃34; （在引号中）引用Python Unicode字符串中的一个字符，而不是一个Unicode代码点。＆＃34; Unicode字符的数量＆＃34;字符串中的字符串是len()。在＆＃34;狭窄＆＃34;构建，一个＆＃34; Unicode字符＆＃34;是一个UTF-16的16位代码单元，因此一个星体字符将显示为两个＆＃34; Unicode字符＆＃34;。在＆＃34;宽＆＃34;构建，一个＆＃34; Unicode字符＆＃34;始终对应一个Unicode代码点。}

使用正则表达式匹配星体平面字符

广泛构建

问题中的正则表达式在＆＃34; wide＆＃34;中正确编译建立：

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import re; re.compile(u'[\U0001d300-\U0001d356]', re.DEBUG)
in
  range (119552, 119638)
<_sre.SRE_Pattern object at 0x7f9f110386b8>

缩小构建

然而，同样的正则表达式在＆＃34;狭窄＆＃34;构建，因为引擎不识别代理对。它只会将\ud834视为一个字符，然后尝试创建从\udf00到\ud834的字符范围并失败。

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> [hex(ord(i)) for i in u'[\U0001d300-\U0001d356]']
['0x5b', '0xd834', '0xdf00', '0x2d', '0xd834', '0xdf56', '0x5d']

解决方法是使用same method as done in ECMAScript，我们将构造正则表达式以匹配代表代码点的代理。

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import re; re.compile(u'\ud834[\udf00-\udf56]', re.DEBUG)
literal 55348
in
  range (57088, 57174)
<_sre.SRE_Pattern object at 0x6ffffe52210>
>>> input =  u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> input
u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> re.sub(u'\ud834[\udf00-\udf56]', '', input)
u'Sample . Another . Leave alone \U00011000'

使用regexpu为Python narrow build

派生星体平面正则表达式

由于Python窄版本中与星体平面字符匹配的构造与ES5相同，您可以使用regexpu，一种将ES6中的RegExp文字转换为ES5的工具，为您进行转换。

只需在ES6中粘贴等效正则表达式（注意u标记和\u{hh...h}语法）：

/[\u{1d300}-\u{1d356}]/u

然后你得到了可以在Python narrow build和ES5中使用的正则表达式

/(?:\uD834[\uDF00-\uDF56])/

如果要在Python中使用正则表达式，请注意删除JavaScript RegExp文字中的分隔符/。

当范围分布在多个高代理（U + D800到U + DBFF）时，该工具非常有用。例如，如果我们必须匹配字符范围

/[\u{105c0}-\u{1cb40}]/u

Python narrow build和ES5中的等效正则表达式是

/(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])/

这是相当复杂且容易出错的。

Python 3.3及以上版本

Python 3.3实现了PEP 393，它消除了窄版本和宽版本之间的区别，而Python从现在开始就像一个广泛的版本。这完全消除了问题中的问题。

兼容性问题

虽然可以在Python窄版本中解决并匹配星体平面角色，但是，最好通过使用Python宽版本来更改执行环境，或者将代码移植到Python中使用3.3及以上。

对于普通程序员来说，窄版本的正则表达式代码很难阅读和维护，并且在移植到Python 3时必须完全重写。

参考

How to find out if Python is compiled with UCS-2 or UCS-4?

我怎样才能表示这个正则表达式没有得到一个＆＃34;糟糕的角色范围＆＃34;错误？

1 个答案: