Question

我试图在python2.7中读取一个文件，并且它完美无缺。我遇到的问题是当我在Python3.4中执行相同的程序然后出现错误：

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

另外，当我在Windows中运行程序（使用python3.4）时，错误不会出现。该文件的第一行是： Codi;Codi_lloc_anonim;Nom

我的程序代码是：

def lectdict(filename,colkey,colvalue):
    f = open(filename,'r')
    D = dict()

    for line in f:
       if line == '\n': continue
       D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]

f.close
return D

Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

Answer 1

在Python2中，

f = open(filename,'r')
for line in f:

从文件中读取行作为字节。

在Python3中，相同的代码从文件中读取字符串。 Python3 字符串是Python2调用unicode对象的内容。这些是字节解码根据一些编码。 Python3中的默认编码是utf-8。

错误消息

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

显示Python3正在尝试将字节解码为utf-8。由于存在错误，该文件显然不包含 utf-8编码的字节。

要解决此问题，您需要指定文件的正确编码：

with open(filename, encoding=enc) as f:
    for line in f:

如果您不知道正确的编码，可以简单地运行此程序尝试Python已知的所有编码。如果你幸运的话会有一个编码，将字节转换为可识别的字符。有时更多一个编码可能出现工作，在这种情况下，你需要检查和仔细比较结果。

# Python3
import pkgutil
import os
import encodings

def all_encodings():
    modnames = set(
        [modname for importer, modname, ispkg in pkgutil.walk_packages(
            path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
    try:
        with open(filename, encoding=enc) as f:
            # print the encoding and the first 500 characters
            print(enc, f.read(500))
    except Exception:
        pass

Answer 2

好的，我跟@unutbu做的一样告诉我。结果是很多编码其中一个是cp1250，因此我改变了：

f = open(filename,'r')

到

f = open(filename,'r', encoding='cp1250')

像@triplee一样建议我。现在我可以阅读我的文件。

＆＃39; UTF-8＆＃39;编解码器不能解码在Python3.4中读取文件的字节，但不能在Python2.7中解码

2 个答案: