Question

我可以动态生成和流式传输文本，但无法动态生成和流式传输压缩文件。

from flask import Flask, request, Response,stream_with_context
import zlib
import gzip

app = Flask(__name__)

def generate_text():
    for x in xrange(10000):
        yield "this is my line: {}\n".format(x)

@app.route('/stream_text')
def stream_text():
    response = Response(stream_with_context(generate_text()))
    return response

def generate_zip():
    for x in xrange(10000):
        yield zlib.compress("this is my line: {}\n".format(x))

@app.route('/stream_zip')
def stream_zip():
    response = Response(stream_with_context(generate_zip()), mimetype='application/zip')
    response.headers['Content-Disposition'] = 'attachment; filename=data.gz'
    return response

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000, debug=True)

比使用curl和gunzip：

curl http://127.0.0.1:8000/stream_zip > data.gz

gunzip data.gz
gunzip: data.gz: not in gzip format

如果是zip，gzip或任何其他类型的压缩，我都不在乎。

我实际代码中的

generate_text生成超过4 GB的数据，因此我想动态压缩。

将文本保存到文件，压缩文件，返回zip文件，然后删除不是我之后的解决方案。

我需要在循环中生成一些文本 - ＆gt;压缩该文本 - ＆gt;流式传输压缩数据，直到我完成。

zip / gzip ...只要有效，任何事情都可以。

Answer 1

您正在产生系列压缩文档，而不是单个压缩流。不要使用zlib.compress()，它包含标题并形成单个文档。

您需要创建zlib.compressobj() object，并使用该对象上的Compress.compress() method生成数据流（然后最后调用Compress.flush()）：

def generate_zip():
    compressor = zlib.compressobj()
    for x in xrange(10000):
        chunk = compressor.compress("this is my line: {}\n".format(x))
        if chunk:
            yield chunk
    yield compressor.flush()

当没有足够的数据产生完整的压缩数据块时，压缩器可以产生空块，只有在实际发送任何内容时才会产生上述压缩。由于您的输入数据具有如此高的重复性，因此数据可以被有效压缩，因此只产生3次（一次使用2字节标头，一次使用大约21kb的压缩数据覆盖xrange()上的前8288次迭代，并且最后剩余的4kb用于剩余的循环）。

总的来说，这会产生与单个zlib.compress()调用相同的数据，并且所有输入都连接在一起。此数据格式的正确mime类型为application/zlib，不 application/zip。

此格式不容易使用gzip解压缩，但不能without some trickery。那是因为上面还没有生成 GZIP 文件，它只生成一个原始的zlib压缩流。要使其与GZIP兼容，您需要configure the compression correctly，首先发送标头，然后在结尾添加CRC checksum和数据长度值：

import zlib
import struct
import time

def generate_gzip():
    # Yield a gzip file header first.
    yield (
        '\037\213\010\000' + # Gzip file, deflate, no filename
        struct.pack('<L', long(time.time())) +  # compression start time
        '\002\377'  # maximum compression, no OS specified
    )

    # bookkeeping: the compression state, running CRC and total length
    compressor = zlib.compressobj(
        9, zlib.DEFLATED, -zlib.MAX_WBITS, zlib.DEF_MEM_LEVEL, 0)
    crc = zlib.crc32("")
    length = 0

    for x in xrange(10000):
        data = "this is my line: {}\n".format(x)
        chunk = compressor.compress(data)
        if chunk:
            yield chunk
        crc = zlib.crc32(data, crc) & 0xffffffffL
        length += len(data)

    # Finishing off, send remainder of the compressed data, and CRC and length
    yield compressor.flush()
    yield struct.pack("<2L", crc, length & 0xffffffffL)

将其作为application/gzip：

投放

@app.route('/stream_gzip')
def stream_gzip():
    response = Response(stream_with_context(generate_gzip()), mimetype='application/gzip')
    response.headers['Content-Disposition'] = 'attachment; filename=data.gz'
    return response

，结果可以动态解压缩：

curl http://127.0.0.1:8000/stream_gzip | gunzip -c | less

Answer 2

虽然Martijn的解决方案给我留下了深刻的印象，但我还是决定推出自己的使用pigz的解决方案以提高性能：

def yield_pigz(results, compresslevel=1):
    cmd = ['pigz', '-%d' % compresslevel]
    pigz_proc = subprocess.Popen(cmd, bufsize=0,
        stdin=subprocess.PIPE, stdout=subprocess.PIPE)

    def f():
        for result in results:
            pigz_proc.stdin.write(result)
            pigz_proc.stdin.flush()
        pigz_proc.stdin.close()
    try:
        t = threading.Thread(target=f)
        t.start()
        while True:
            buf = pigz_proc.stdout.read(4096)
            if len(buf) == 0:
                break
            yield buf
    finally:
        t.join()
        pigz_proc.wait()

请记住，您需要导入subprocess和threading才能使其正常工作。您还需要安装pigz程序（已经在大多数Linux发行版的存储库中-在Ubuntu上，只需使用sudo apt install pigz -y）。

用法示例：

from flask import Flask, Response
import subprocess
import threading
import random

app = Flask(__name__)

def yield_something_random():
    for i in range(10000):
        seq = [chr(random.randint(ord('A'), ord('Z'))) for c in range(1000)]
        yield ''.join(seq)

@app.route('/')
def index():
    return Response(yield_pigz(yield_something_random()))

Answer 3

我认为目前你只是发送生成器而不是数据！你可能想做这样的事情（我还没有测试过，所以可能需要做一些改变）：

def generate_zip():
    import io
    with gzip.GzipFile(fileobj=io.BytesIO(), mode='w') as gfile:
        for x in xrange(10000):
             gfile.write("this is my line: {}\n".format(x))
    return gfile.read()

Answer 4

工作generate_zip()内存消耗低:)：

def generate_zip():
    buff = io.BytesIO()
    gz = gzip.GzipFile(mode='w', fileobj=buff)
    for x in xrange(10000):
        gz.write("this is my line: {}\n".format(x))
        yield buff.read()
        buff.truncate()
    gz.close()
    yield buff.getvalue()

使用Flask生成并流式传输压缩文件

4 个答案: