Question

我有一个很大的本地文件。我想使用boto库将该文件的gzip压缩版本上传到S3。该文件太大，无法在上传之前在磁盘上高效地进行gzip，因此在上传过程中应该以流式方式对其进行gzip压缩。

boto库知道一个函数set_contents_from_file()，它需要一个类似文件的对象。

gzip库知道可以通过名为GzipFile的参数获取对象的类fileobj;它会在压缩时写入此对象。

我想将这两个函数结合起来，但是一个API想要自己阅读，另一个API想要自己编写;既不知道被动操作（如写入或被读取）。

有没有人知道如何以工作方式组合这些？

编辑：我接受了一个答案（见下文），因为它暗示我要去哪里，但是如果你遇到同样的问题，你可能会发现我自己的答案（也在下面）更有帮助，因为我使用multipart实现了一个解决方案上传。

Answer 1

我实施了garnaat接受的答案评论中暗示的解决方案：

import cStringIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = cStringIO.StringIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with file(fileName) as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()

似乎没有问题。毕竟，流媒体在大多数情况下只是数据的分块。在这种情况下，大块大约10MB，但谁在乎呢？只要我们不讨论几个GB块，我就可以了。

Python 3的更新：

from io import BytesIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = BytesIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with open(fileName, "rb") as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()

Answer 2

实际上没有办法做到这一点，因为S3不支持真正的流输入（即分块传输编码）。您必须在上传之前知道Content-Length，并且知道这是首先执行gzip操作的唯一方法。

Answer 3

您还可以轻松地使用gzip压缩字节，然后按以下内容轻松上传：

PASS,{"Category": "Test", "COMMENT": "reporting period in dataset: ['APR-2018', 'MAY-2018', 'JUN-2018']"},2019-09-03 13:56:08

可以用任何字节，io.BytesIO，泡菜转储，文件等替换import gzip import boto3 cred = boto3.Session().get_credentials() s3client = boto3.client('s3', aws_access_key_id=cred.access_key, aws_secret_access_key=cred.secret_key, aws_session_token=cred.token ) bucketname = 'my-bucket-name' key = 'filename.gz' s_in = b"Lots of content here" gzip_object = gzip.compress(s_in) s3client.put_object(Bucket=bucket, Body=gzip_object, Key=key)。

如果您要上传压缩的Json，则下面是一个很好的示例：Upload compressed Json to S3

如何使用boto上传到s3时进行gzip

3 个答案: