组合文本文件时出现编码错误

时间:2016-03-14 19:00:52

标签: python text unicode utf-8

我试图运行此代码:

import glob
import io

read_files = filter(lambda f: f!='final.txt' and f!='result.txt', glob.glob('*.txt'))


with io.open("REGEXES.rx.txt", "w", encoding='UTF-32') as outfile:
    for f in read_files:
        with open(f, "r") as infile:
            outfile.write(infile.read())
            outfile.write('|')

要合并一些文本文件,我收到此错误:

Traceback (most recent call last):
  File "/Users/kosay.jabre/Desktop/Password Assessor/RegexesNEW/CombineFilesCopy.py", line 10, in <module>
    outfile.write(infile.read())
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 2189: ordinal not in range(128)

我尝试过UTF-8,UTF-16,UTF-32和latin-1编码。有什么想法吗?

1 个答案:

答案 0 :(得分:2)

您从infile.read()收到错误。文件在文本模式下打开,未指定编码。 Python将尝试猜测您的默认文件编码,但可能默认为ascii。任何大于\x7f / 127的字节都不是ASCI,因此会抛出错误。

在继续操作之前,您需要知道文件的编码,否则如果Python尝试读取一个编码并获得另一个编码,您将会收到错误,或者您只是获得mojibake。

假设 infile将是utf-8编码,请更改:

with open(f, "r") as infile:

为:

with open(f, "r", encoding="utf-8") as infile:

您可能还希望将outfile的编码更改为UTF-8,以避免潜在的存储浪费。因为输入被解码为纯Unicode,所以infile和outfile的编码不需要匹配。