Question

在python中处理非ascii代码char真的很混乱。任何人都可以解释一下吗？

我正在尝试读取纯文本文件并用空格替换所有非字母字符。

我有一个字符列表：

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—')

对于我得到的每个令牌，我通过调用

替换该令牌中的任何字符

    for punc in ignorelist:
        token = token.replace(punc, ' ')

注意ignorelist末尾有一个非ascii代码字符：u'—'

每当我的代码遇到该角色时，它就会崩溃并说：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position

我尝试通过在文件顶部添加# -*- coding: utf-8 -*-来声明编码，但仍无效。有谁知道为什么？谢谢！

Answer 1

您的文件输入不是utf-8。所以当你在比较中点击那个unicode字符时你的输入barf因为它将你的输入视为ascii。

请尝试使用此文件读取文件。

import codecs
f = codecs.open("test", "r", "utf-8")

Answer 2

您正在使用Python 2.x，它会尝试自动转换unicode和普通str，但它通常会因非ascii字符而失败。

您不应将unicode和str混合在一起。你可以坚持unicode s：

ignorelist = (u'!', u'-', u'_', u'(', u')', u',', u'.', u':', u';', u'"', u'\'', u'?', u'#', u'@', u'$', u'^', u'&', u'*', u'+', u'=', u'{', u'}', u'[', u']', u'\\', u'|', u'<', u'>', u'/', u'—')

if not isinstance(token, unicode):
    token = token.decode('utf-8') # assumes you are using UTF-8
for punc in ignorelist:
    token = token.replace(punc, u' ')

或仅使用普通str（注意最后一个）：

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—'.encode('utf-8'))
# and other parts do not need to change

通过手动将u'—'编码为str，Python不需要单独尝试。

我建议您在整个程序中使用unicode以避免此类错误。但如果工作太多，你可以使用后一种方法。但是，在标准库或第三方模块中调用某些功能时要小心。

# -*- coding: utf-8 -*-只告诉Python你的代码是用UTF-8编写的（或者你会得到SyntaxError）。

在python中处理非ascii代码字符串

2 个答案: