Question

我有大量文件（＆gt; 1,000）存储在S3存储桶中，我想迭代它们（例如在for循环中）以使用{{1}从它们中提取数据}。

但是，我注意到，根据http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects，boto3类的list_objects()方法最多只能列出1,000个对象：

Client

但是，我想列出所有对象，即使有超过1,000个。我怎么能做到这一点？

Answer 1

正如kurt-peek所指出的，boto3有一个Paginator类，它允许你对s3对象的页面进行迭代，并且可以很容易地用来为页面中的项提供迭代器： / p>

import boto3


def iterate_bucket_items(bucket):
    """
    Generator that iterates over all objects in a given s3 bucket

    See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 
    for return data format
    :param bucket: name of s3 bucket
    :return: dict of metadata for an object
    """


    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket)

    for page in page_iterator:
        if page['KeyCount'] > 0:
            for item in page['Contents']:
                yield item


for i in iterate_bucket_items(bucket='my_bucket'):
    print i

将输出如下内容：

{u'ETag': '"a8a9ee11bd4766273ab4b54a0e97c589"',
 u'Key': '2017-06-01-10-17-57-EBDC490AD194E7BF',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 17, 58, tzinfo=tzutc()),
 u'Size': 242,
 u'StorageClass': 'STANDARD'}
{u'ETag': '"03be0b66e34cbc4c037729691cd5efab"',
 u'Key': '2017-06-01-10-28-58-732EB022229AACF7',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 28, 59, tzinfo=tzutc()),
 u'Size': 238,
 u'StorageClass': 'STANDARD'}
...

请注意，建议使用list_objects_v2代替list_objects：https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

您也可以在较低级别执行此操作，直接致电list_objects_v2()并将响应中的NextContinuationToken值作为ContinuationToken传递，同时isTruncated在回复中为真

Answer 2

我发现boto3有一个Paginator类来处理截断的结果。以下对我有用：

paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket='iper-apks')

之后我可以在page_iterator循环中使用for生成器。

Answer 3

import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.AmazonS3ClientBuilder
import com.amazonaws.services.s3.model.ListObjectsRequest
import java.util._

import scala.collection.JavaConverters._

val s3client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).build()
val listObjectsRequest = new ListObjectsRequest().withBucketName("<enter_bucket_name>").withPrefix("<enter_path>").withDelimiter("/")
val bucketListing = s3client.listObjects(listObjectsRequest).getCommonPrefixes.asScala

println("")

for (file <- bucketListing) {
    println(file)
}

println("")

如何迭代S3存储桶中的文件？

3 个答案: