首先,我知道有一种标准方式可以完成我在标题中说明的任务。例如,
import csv
with open('test.txt', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
我在Jupyter终端上的数据文件(~262MB)上应用此代码,我明白了:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-21-cbed80c58499> in <module>()
2 with open('CarRecord.txt', encoding='utf-8') as f:
3 reader = csv.reader(f)
----> 4 for row in reader:
5 print(row)
//anaconda/envs/py35/lib/python3.5/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 74: invalid start byte
好的,位置74位于我的数据文件的第一行,其中第一个是中文字符。过来。所以我做了另一个快速测试,我从数据文件中复制了前几行并将它们粘贴到另一个新文件中。我使用测试文件运行相同的代码,现在它正常工作,没有任何错误消息。
有人有什么想法吗?....
------根据评论中的想法更新:-------
import csv
with open('CarRecord.txt', mode='rb') as f:
decoded_file = f.read().decode('utf-16')
reader = csv.reader(decoded_file, delimiter=',')
for row in reader:
print(row)
现在我明白了:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-37-3708b52ef0a3> in <module>()
1 import csv
2 with open('CarRecord.txt', mode='rb') as f:
----> 3 decoded_file = f.read().decode('utf-16')
4 reader = csv.reader(decoded_file, delimiter=',')
5 for row in reader:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1780-1781: illegal UTF-16 surrogate
答案 0 :(得分:0)
这不是问题的准确答案。
事实证明,原始数据文件 - 尽管它包含unicode字符 - 是用ASCII编码的。所以我保存了一个新的数据文件,并使用utf-8对其进行编码,并且读取CSV文件的标准方法正常工作。