Question

我试过

with zipfile.ZipFile("5.csv.zip", "r") as zfile:
    for name in zfile.namelist():
        with zfile.open(name, 'rU') as readFile:
                line = readFile.readline()
                print(line)
                split = line.split('\t')

答案：

b'$0.0\t1822\t1\t1\t1\n'
Traceback (most recent call last)
File "zip.py", line 6
    split = line.split('\t')
TypeError: Type str doesn't support the buffer API

如何以unicode而不是b打开文本文件？

Answer 1

要将字节流转换为Unicode流，您可以使用io.TextIOWrapper()：

encoding = 'utf-8'
with zipfile.ZipFile("5.csv.zip") as zfile:
    for name in zfile.namelist():
        with zfile.open(name) as readfile:
            for line in io.TextIOWrapper(readfile, encoding):
                print(repr(line))

注意：TextIOWrapper()默认使用通用换行模式。自版本3.4以来，rU中的zfile.open()模式已被弃用。

它避免了@Peter DeGlopper's answer中描述的多字节编码问题。

Answer 2

编辑对于Python 3，使用io.TextIOWrapper作为J. F. Sebastian所描述的是最佳选择。下面的答案仍然有助于2.x.我认为即使对于3.x，下面的任何内容实际上都不正确，但io.TestIOWrapper仍然更好。

如果文件是utf-8，则可以使用：

# the rest of the code as above, then:
with zfile.open(name, 'rU') as readFile:
    line = readFile.readline().decode('utf8')
    # etc

如果您要对文件进行迭代，可以使用codecs.iterdecode，但这不适用于readline()。

with zfile.open(name, 'rU') as readFile:
    for line in codecs.iterdecode(readFile, 'utf8'):
        print line
        # etc

请注意，对于多字节编码，这两种方法都不一定安全。例如，little-endian UTF-16表示字节为b'\x0A\x00'的换行符。寻找换行符的非unicode感知工具将错误地拆分，在下一行留下空字节。在这种情况下，您必须使用不尝试按换行符分割输入的内容，例如ZipFile.read，然后立即解码整个字节字符串。这不是utf-8的关注点。

Answer 3

您看到错误的原因是因为您尝试将字节与unicode混合使用。 split的参数也必须是字节字符串：

>>> line = b'$0.0\t1822\t1\t1\t1\n'
>>> line.split(b'\t')
[b'$0.0', b'1822', b'1', b'1', b'1\n']

要获取unicode字符串，请使用decode：

>>> line.decode('utf-8')
'$0.0\t1822\t1\t1\t1\n'

如何在zip中打开unicode文本文件？

3 个答案: