Question

在Python 3中，read（ size ）具有以下文档：

从流中读取并返回最多 size 字符作为单个str。如果大小为负数或None，则读取直至EOF。

但是假设您seek()到多字节UTF-8字符的中间。 read(1)将返回什么？

Answer 1

部分unicode字符无法解码，因此python将引发UnicodeDecodeError。但是你可以从问题中恢复过来。 UTF-8编码构建为自我修复，这意味着字符序列的第一个字节（0x00-0x7f或0xc0-0xfd）不会出现在任何其他字节中，因此您只需要向后搜索1个字节直到解码工作。

>>> def read_unicode(fp, position, count):
...     while position >= 0:
...         fp.seek(position)
...         try:
...             return fp.read(count)
...         except UnicodeDecodeError:
...             position -= 1
...     raise UnicodeDecodeError("File not decodable")
... 
>>> open('test.txt', 'w', encoding='utf-8').write("学"*10000)
10000
>>> f=open('test.txt', 'r', encoding='utf-8')
>>> f.seek(32)
32
>>> f.read(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 0: invalid start byte
>>> read_unicode(f, 32, 1)
'学'

Answer 2

Python 3中的文本流不支持任意搜索偏移，您只应使用0的偏移量，或tell whence SEEK_SET返回的值。其他所有内容都是未定义或不受支持的行为。请参阅the docs for TextIOBase.seek。

当然，在实践中，您可能会得到UnicodeDecodeError，但这不是保证。一旦违反API合同要求，它就可以做任何想做的事情。

如果你寻求（）到多字节UTF-8字符的中间并调用read（1）会发生什么？

2 个答案: