Question

我正在开发一个程序，我应该拒绝U + 10FFFF以上的任何代码点。这看起来很简单，除了我无法弄清楚如何在我的正则表达式中表示这样一系列代码点。我想做这样的事情

valid_character = re.compile(u'[\u0000-\u10FFFF]')

然后有任何不匹配的东西要妥善处理。但是，\u似乎只识别前四个字符，即10FF。有没有其他方法来表示此代码点范围或处理这种情况？

This site建议使用u"\U0010FFFF"，但这似乎也不起作用。

>>> ord(u'\U0010FFFF')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

Answer 1

如果您使用违反规范的UTF-8解码文件，Python会抛出错误，因此问题的答案是＆＃34;只需打开文件并将其解码为UTF-8＆＃34;。如果字符无效，Python将处理它。

示例：

>>> b'\xf4\x8f\xbf\xbf'.decode('utf8')
u'\U0010ffff'

# UTF-8 equivalent to \U00110000...
>>> len(b'\xf4\x90\x80\x80'.decode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\dev\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid continuation byte

识别U + 10FFFF以上的代码点

1 个答案: