Question

我有一个文本样本：

"PROTECTING-ħarsien",

我正在尝试使用以下

进行解析

import csv, json

with open('./dict.txt') as maltese:
    entries = maltese.readlines()
    for entry in entries:
        tokens = entry.replace('"', '').replace(",", "").replace("\r\n", "").split("-")
        if len(tokens) == 1:
            pass
        else:   
            print tokens[0] + "," + unicode(tokens[1])

但我收到一条错误消息

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)

我做错了什么？

Answer 1

似乎dict.txt采用UTF-8编码（ħ为0xc4 0xa7（UTF-8）。

你应该open the file as UTF-8，然后：

import codecs
with codecs.open('./dict.txt', encoding="utf-8") as maltese:
    # etc.

然后您将使用Unicode字符串而不是字节串来处理;因此，您不需要对它们调用unicode()，但您可能需要将它们重新编码为您输出的终端的编码。

Answer 2

您必须将最后一行更改为（已经过测试以处理您的数据）：

print tokens[0] + "," + unicode(tokens[1], 'utf8')

如果您没有utf8，则Python假定源是ascii编码，因此错误。

请参阅http://docs.python.org/2/howto/unicode.html#the-unicode-type

如何让Python解析以下文本？

2 个答案: