使用简单的Python脚本读取字典单词文件时的UnicodeDecodeError

时间:2009-06-19 10:35:22

标签: python

第一次使用Python做一段时间,当我使用Python 3.0.1运行以下脚本时,我无法对文件进行简单扫描,

with open("/usr/share/dict/words", 'r') as f:
   for line in f:
       pass

我得到了这个例外:

Traceback (most recent call last):
  File "/home/matt/install/test.py", line 2, in <module>
    for line in f:
  File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__
    line = self.readline()
  File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline
    while self._read_chunk():
  File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode
    output = self.decoder.decode(input, final=final)
  File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data

它爆炸的文件中的行是“阿根廷人”,这在某种程度上似乎并不罕见。

更新:我添加了

encoding="iso-8559-1"

到open()调用,它修复了问题。

2 个答案:

答案 0 :(得分:1)

您可以检查以确保它是有效的UTF-8吗?在this SO question

提供了一种方法
iconv -f UTF-8 /usr/share/dict/words -o /dev/null

还有其他方法可以做同样的事情。

答案 1 :(得分:1)

你是如何从“位置1689-1692”中确定文件中的哪一行被炸毁的?这些数字将是它试图解码的块中的偏移量。你不得不确定它是什么块 - 如何?

在交互式提示下尝试此操作:

buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting