Question

到目前为止，文件只是单独下载，如下所示，而不是全部都在一个压缩文件中：

s3client = boto3.client('s3')

t.download_file（'firstbucket'，obj ['Key']，filename）

Answer 1

使用AWS CLI让我省去一些麻烦：

aws s3 cp s3://mybucket/mydir/ . --recursive ; zip myzip.zip *.csv

您可以更改通配符以满足您的需求，但这比Python看起来更快，因为AWS CLI的优化远远超出了boto的功能

Answer 2

如果你想使用boto，你必须像你一样在循环中完成它并将每个项目添加到zip文件中。

使用CLI，您可以使用s3同步然后将其压缩 https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

aws s3 sync s3://bucket-name ./local-location && zip bucket.zip ./local-location

Answer 3

您似乎非常接近，但您需要将文件名传递给ZipFile.write()，而download_file不会返回文件名。以下应该可以正常工作，但我还没有详尽地测试它。

from tempfile import NamedTemporaryFile
from zipfile import ZipFile

import boto3


def archive_bucket(bucket_name, zip_name):
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')

    with ZipFile(zip_name, 'w') as zf:
        for page in paginator.paginate(Bucket=bucket_name):
            for obj in page['Contents']:
                # This might have issues on some systems since the file will
                # be open for writes in two places. You can use other
                # methods of creating a temporary file to work around that.
                with NamedTemporaryFile() as f:
                    s3.download_file(bucket_name, obj['Key'], f.name)
                    # Copies over the temprary file using the key as the
                    # file name in the zip.
                    zf.write(f.name, obj['Key'])

与使用CLI的解决方案相比，它占用的空间更少，但它仍然不是理想的选择。在某个时间点，您仍将拥有给定文件的两个副本：一个在临时文件中，另一个已经压缩。因此，您需要确保磁盘上有足够的空间来支持您下载的所有文件的大小以及这些文件中最大的文件的大小。如果有办法打开一个类似文件的对象，直接写入zip目录中的文件，那么你可以解决这个问题。但是，我不知道该怎么做。

如何找到数据的来源？

3 个答案: