Question

首先，我知道有一种标准方式可以完成我在标题中说明的任务。例如，

import csv
with open('test.txt', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

我在Jupyter终端上的数据文件（~262MB）上应用此代码，我明白了：

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-21-cbed80c58499> in <module>()
      2 with open('CarRecord.txt', encoding='utf-8') as f:
      3     reader = csv.reader(f)
----> 4     for row in reader:
      5         print(row)

//anaconda/envs/py35/lib/python3.5/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 74: invalid start byte

好的，位置74位于我的数据文件的第一行，其中第一个是中文字符。过来。所以我做了另一个快速测试，我从数据文件中复制了前几行并将它们粘贴到另一个新文件中。我使用测试文件运行相同的代码，现在它正常工作，没有任何错误消息。

有人有什么想法吗？....

------根据评论中的想法更新：-------

import csv
with open('CarRecord.txt', mode='rb') as f:
    decoded_file = f.read().decode('utf-16')
    reader = csv.reader(decoded_file, delimiter=',')
    for row in reader:
        print(row)

现在我明白了：

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-37-3708b52ef0a3> in <module>()
      1 import csv
      2 with open('CarRecord.txt', mode='rb') as f:
----> 3     decoded_file = f.read().decode('utf-16')
      4     reader = csv.reader(decoded_file, delimiter=',')
      5     for row in reader:

UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1780-1781: illegal UTF-16 surrogate

Answer 1

这不是问题的准确答案。

事实证明，原始数据文件 - 尽管它包含unicode字符 - 是用ASCII编码的。所以我保存了一个新的数据文件，并使用utf-8对其进行编码，并且读取CSV文件的标准方法正常工作。

使用Python 3.5

1 个答案: