Question

我有一段在Python3中运行良好的代码：

def encode_test(filepath, char_to_int):
    with open(filepath, "r", encoding= "latin-1") as f:
        dat = [line.rstrip() for line in f]
        string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat]

然而，当我尝试在Python2.7中执行此操作时，我首先得到了错误

SyntaxError: Non-ASCII character '\xc3' in file languageIdentification.py on line 30, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

然后我意识到我可能需要在代码顶部添加#coding = utf-8。但是，在执行此操作后，我遇到了另一个错误：

UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat]
Traceback (most recent call last):
File "languageIdentification.py", line 190, in <module>
test_string = encode_test(sys.argv[3], char_to_int)
File "languageIdentification.py", line 32, in encode_test
string_to_int = [[char_to_int[char] if char != 'ó' else 
char_to_int['ò'] for char in line] for line in dat]
KeyError: u'\xf3'

所以有人可以告诉我如何在Python2.7中解决这个问题？

谢谢！

Answer 1

问题是您尝试比较unicode-string和byte-string ：

char != 'ó'

其中char是unicode，'ó'是字节字符串（或只是str）。

当Python 2面对这样的比较时，它会尝试转换（或解码）：

byte-string -> unicode

转换提供了默认编码，即Python 2中的ASCII 由于'ó'的字节值高于127，因此会导致错误（UnicodeWarning）。

顺便说一句，对于字面值，哪个字节值在ASCII范围内，比较会成功的。
例子：

print u'ó' == 'ó' # UnicodeWarning: ...
print u'z' == 'z' # True

因此，相比之下，您需要手动将字节字符串转换为unicode 例如，您可以使用内置unicode()函数：

来实现

u = unicode('ó', 'utf-8') # note, that you can specify encoding

或仅使用'u' - 文字：

u = u'ó'

但要注意：使用此选项，转换将通过您在源文件顶部指定的编码实现。
因此，您的实际源编码和在顶部声明的编码应匹配。

正如我从SyntaxError消息中看到的那样：在您的来源'ó'中以'\xc3'开头 - 字节。
因此它应该是'\xc3\xb3'，这是UTF -8：

print '\xc3\xb3'.decode('utf-8') # ó

因此，# coding: utf-8 + char != u'ó'应该可以解决您的问题。

UPD。的

正如我从UnicodeWarning消息中看到的那样，还有第二个问题：KeyError

语句中出现此错误：

char_to_int[char]

因为u'\xf3'（实际上是u'ó'）不是有效密钥。

此unicode来自解码您的文件（使用latin-1）而且我想，你的dict char_to_int中根本没有unicode键。

因此，尝试使用以下方法将这样的密钥编码回其字节值：

char_to_int[char.encode('latin-1')]

总结，尝试将提供的代码的最后一个字符串更改为：

string_to_int = [[char_to_int[char.encode('latin-1')] if char != u'ó' else char_to_int['ò'] for char in line] for line in dat]

Answer 2

如果要将字符转换为整数值，可以使用ord函数，它也适用于Unicode。

line = u’some Unicode line with ò and ó’
string_to_int = [ord(char) if char!=u‘ó’ else ord(u’ò’) for char in line]

在python2中编码特殊字符

2 个答案: