Question

在阅读我的GBK编码文件时调用tell()会导致下一次调用readline()以提升UnicodeDecodeError。但是，如果我不打电话给tell()，则不会引发此错误。

C：\ tmp＆gt; hexdump badtell.txt

000000: 61 20 6B 0D 0A D2 BB B0-E3                       a k......

C：\ tmp＆gt;输入test.py

with open(r'c:\tmp\badtell.txt', "r", encoding='gbk') as f:
    while True:
        pos = f.tell()
        line = f.readline();
        if not line: break
        print(line)

C：\ tmp＆gt; python test.py

a k

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    line = f.readline();
UnicodeDecodeError: 'gbk' codec can't decode byte 0xd2 in position 0:  incomplete multibyte sequence

当我删除f.tell()语句时，它已成功解码。为什么？我在Win7 / Win10上尝试过Python3.4 / 3.5 x64，它们都是一样的。

任何人，任何想法？我应该报告错误吗？

我有一个大文本文件，我真的想获得这个大文本的文件位置范围，是否有解决方法？

Answer 1

我刚刚在Linux上的Python 3.4 x64上复制了这个。查看TextIOBase的文档，我没有看到任何说tell()导致读取文件出现问题的内容，所以可能确实存在错误。

b'\xd2'.decode('gbk')

给出的错误与您看到的错误相同，但在您的文件中该字节后跟字节BB，

b'\xd2\xbb'.decode('gbk')

给出的值等于'\u4e00'，而不是错误。

我找到了一个解决方法，适用于原始问题中的数据，，但不适用于其他数据，正如您之后发现的那样。希望我知道为什么！我在每seek()后调用了tell()，其值为tell()：

pos = f.tell()
f.seek(pos)
line = f.readline()

f.seek(f.tell())的替代方法是使用seek()的{{3}}模式来提供排名。偏移量为0时，这与上面的代码相同：移动到当前位置并获得该位置。

pos = f.seek(0, io.SEEK_CUR)
line = f.readline()

Answer 2

好的，有一个解决方法，它到目前为止有效：

with open(r'c:\tmp\badtell.txt', "rb") as f:
    while True:
        pos = f.tell()
        line = f.readline();
        if not line: break
        line = line.decode("gbk").strip('\n')
        print(line)

我昨天在这里提交了一个问题：http://bugs.python.org/issue26990

还没有回复

为什么file.tell（）会影响编码？

2 个答案: