Question

有没有办法对单文件zip压缩文件进行流解压缩？

我目前在s3中有任意大的压缩档案（每个档案单个文件）。我希望能够通过遍历它们来处理文件，而无需将文件实际下载到磁盘或内存中。

一个简单的例子：

import boto

def count_newlines(bucket_name, key_name):
    conn = boto.connect_s3()
    b = conn.get_bucket(bucket_name)
    # key is a .zip file
    key = b.get_key(key_name)

    count = 0
    for chunk in key:
        # How should decompress happen?
        count += decompress(chunk).count('\n')

    return count

This answer演示了一种使用gzip＆＃d; d文件执行相同操作的方法。不幸的是，我还没有能够使用zipfile模块使用相同的技术，因为它似乎需要随机访问整个文件被解压缩。

Answer 1

您可以使用https://pypi.python.org/pypi/tubing，它甚至使用boto3构建了s3源支持。

from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
    | pipes.Gunzip() \
    | pipes.Split(on=b'\n') \
    | sinks.Objects()
print len(output)

如果您不想将整个输出存储在返回的接收器中，您可以创建自己的接收器，这只是重要的。 impl看起来像：

class CountWriter(object):
    def __init__(self):
        self.count = 0
    def write(self, chunk):
        self.count += len(chunk)
Counter = sinks.MakeSink(CountWriter)

Answer 2

zip标题位于文件的末尾，这就是它需要随机访问的原因。请参阅https://en.wikipedia.org/wiki/Zip_(file_format)#Structure。

您可以解析应位于文件开头的本地文件头以获取简单的zip，并使用zlib解压缩字节（请参阅zipfile.py）。这不是读取zip文件的有效方法，虽然它可能适用于您的特定场景，但它也可能在很多有效的拉链上失败。读取中央目录文件头是读取zip的唯一正确方法。

Answer 3

是的，但如果必须使用Python，您可能必须编写自己的代码才能执行此操作。您可以在C中查看sunzip中有关如何从流中解压缩zip文件的示例。 sunzip在解压缩zip条目时创建临时文件，然后在最后读取中心目录时移动这些文件并设置其属性。声称您必须能够寻找中心目录才能正确解压缩zip文件是不正确的。

Answer 4

您可以使用ZipFile在Python 3.4.3中执行以下操作：

with ZipFile('spam.zip') as myzip:
    with myzip.open('eggs.txt') as myfile:
        print(myfile.read())

Python Docs

Answer 5

虽然我怀疑绝对不可能使用所有 zip 文件，但我也怀疑几乎（？）所有现代 zip 文件都与流兼容，并且可以进行流解压缩，例如使用 https://github.com/uktrade/stream-unzip [full披露：最初是我写的]

自述文件中的示例展示了如何使用 httpx 对任意 http 请求执行此操作

from stream_unzip import stream_unzip
import httpx

def zipped_chunks():
    # Any iterable that yields a zip file
    with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
        yield from r.iter_bytes()

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        print(chunk)

但我认为它可以适用于 boto3 从 S3 流式解压/解压（未经测试）：

from stream_unzip import stream_unzip
import boto3

def zipped_chunks():
    yield from boto3.client('s3', region_name='us-east-1').get_object(
        Bucket='my-bucket-name',
        Key='the/key/of/the.zip'
    )['Body'].iter_chunks()

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        print(chunk)

在python中流式传输zip压缩文件

5 个答案: