Retrieving subfolders names in S3 bucket from boto3

时间:2016-03-04 18:04:20

标签: python amazon-web-services amazon-s3 boto3

Using boto3, I can access my AWS S3 bucket:

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')

Now, the bucket contains folder first-level, which itself contains several sub-folders named with a timestamp, for instance 1456753904534. I need to know the name of these sub-folders for another job I'm doing and I wonder whether I could have boto3 retrieve those for me.

So I tried:

objs = bucket.meta.client.list_objects(Bucket='my-bucket-name')

which gives a dictionary, whose key 'Contents' gives me all the third-level files instead of the second-level timestamp directories, in fact I get a list containing things as

{u'ETag': '"etag"', u'Key': first-level/1456753904534/part-00014', u'LastModified': datetime.datetime(2016, 2, 29, 13, 52, 24, tzinfo=tzutc()),
u'Owner': {u'DisplayName': 'owner', u'ID': 'id'},
u'Size': size, u'StorageClass': 'storageclass'}

you can see that the specific files, in this case part-00014 are retrieved, while I'd like to get the name of the directory alone. In principle I could strip out the directory name from all the paths but it's ugly and expensive to retrieve everything at third level to get the second level!

I also tried something reported here:

for o in bucket.objects.filter(Delimiter='/'):
    print(o.key)

but I do not get the folders at the desired level.

Is there a way to solve this?

19 个答案:

答案 0 :(得分:52)

下面的代码只返回s3存储桶中''文件夹'中的'子文件夹'。

import boto3
bucket = 'my-bucket'
#Make sure you provide / in the end
prefix = 'prefix-name-with-slash/'  

client = boto3.client('s3')
result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/')
for o in result.get('CommonPrefixes'):
    print 'sub folder : ', o.get('Prefix')

有关详细信息,请参阅https://github.com/boto/boto3/issues/134

答案 1 :(得分:25)

我花了很多时间才弄明白,但最后这里是一个使用boto3列出S3存储桶中子文件夹内容的简单方法。希望它有所帮助

prefix = "folderone/foldertwo/"
s3 = boto3.resource('s3')
bucket = s3.Bucket(name="bucket_name_here")
FilesNotFound = True
for obj in bucket.objects.filter(Prefix=prefix):
     print('{0}:{1}'.format(bucket.name, obj.key))
     FilesNotFound = False
if FilesNotFound:
     print("ALERT", "No file in {0}/{1}".format(bucket, prefix))

答案 2 :(得分:14)

S3是一个对象存储,它没有真正的目录结构。 “/”相当美观。 人们希望拥有目录结构的一个原因,因为他们可以维护/修剪/添加树到应用程序。对于S3,您将此类结构视为索引或搜索标记的排序。

要在S3中操作对象,您需要boto3.client或boto3.resource,例如 列出所有对象

import boto3 
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name') 

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects

实际上,如果使用'/'分隔符存储s3对象名,则可以使用python os.path函数来提取文件夹前缀。

import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key) 
foldername = os.path.dirname(s3_key)

# if you are not using conventional delimiter like '#' 
s3_key = 'first-level#1456753904534#part-00014
filename = s3_key.split("#")[-1]

关于boto3:boto3.resource的提醒是一个不错的高级API。使用boto3.client和boto3.resource有利有弊。如果您开发内部共享库,使用boto3.resource将为您提供所用资源的黑盒层。

答案 3 :(得分:13)

简短答案

  • 使用Delimiter='/'。这样可以避免对存储桶进行递归列出。这里的一些答案错误地建议进行完整列表并使用一些字符串操作来检索目录名称。这可能是非常低效的。请记住,S3实际上对存储桶可以包含的对象数量没有限制。因此,想象一下,在bar/foo/之间,您有一万亿个对象:您将等待很长时间才能获得['bar/', 'foo/']

  • 使用Paginators。出于相同的原因(S3是工程师对无穷大的近似值),您必须列出页面,并避免将所有列表存储在内存中。而是将您的“列表”视为迭代器,并处理它产生的流。

  • 使用boto3.client,而不是boto3.resourceresource版本似乎不能很好地处理Delimiter选项。如果您有资源,例如说bucket = boto3.resource('s3').Bucket(name),则可以通过以下方式获得相应的客户端:bucket.meta.client

好答案

以下是我用于简单存储桶的迭代器(无版本处理)。

import boto3
from collections import namedtuple
from operator import attrgetter


S3Obj = namedtuple('S3Obj', ['key', 'mtime', 'size', 'ETag'])


def s3list(bucket, path, start=None, end=None, recursive=True, list_dirs=True,
           list_objs=True, limit=None):
    """
    Iterator that lists a bucket's objects under path, (optionally) starting with
    start and ending before end.

    If recursive is False, then list only the "depth=0" items (dirs and objects).

    If recursive is True, then list recursively all objects (no dirs).

    Args:
        bucket:
            a boto3.resource('s3').Bucket().
        path:
            a directory in the bucket.
        start:
            optional: start key, inclusive (may be a relative path under path, or
            absolute in the bucket)
        end:
            optional: stop key, exclusive (may be a relative path under path, or
            absolute in the bucket)
        recursive:
            optional, default True. If True, lists only objects. If False, lists
            only depth 0 "directories" and objects.
        list_dirs:
            optional, default True. Has no effect in recursive listing. On
            non-recursive listing, if False, then directories are omitted.
        list_objs:
            optional, default True. If False, then directories are omitted.
        limit:
            optional. If specified, then lists at most this many items.

    Returns:
        an iterator of S3Obj.

    Examples:
        # set up
        >>> s3 = boto3.resource('s3')
        ... bucket = s3.Bucket(name)

        # iterate through all S3 objects under some dir
        >>> for p in s3ls(bucket, 'some/dir'):
        ...     print(p)

        # iterate through up to 20 S3 objects under some dir, starting with foo_0010
        >>> for p in s3ls(bucket, 'some/dir', limit=20, start='foo_0010'):
        ...     print(p)

        # non-recursive listing under some dir:
        >>> for p in s3ls(bucket, 'some/dir', recursive=False):
        ...     print(p)

        # non-recursive listing under some dir, listing only dirs:
        >>> for p in s3ls(bucket, 'some/dir', recursive=False, list_objs=False):
        ...     print(p)
"""
    kwargs = dict()
    if start is not None:
        if not start.startswith(path):
            start = os.path.join(path, start)
        # note: need to use a string just smaller than start, because
        # the list_object API specifies that start is excluded (the first
        # result is *after* start).
        kwargs.update(Marker=__prev_str(start))
    if end is not None:
        if not end.startswith(path):
            end = os.path.join(path, end)
    if not recursive:
        kwargs.update(Delimiter='/')
        if not path.endswith('/'):
            path += '/'
    kwargs.update(Prefix=path)
    if limit is not None:
        kwargs.update(PaginationConfig={'MaxItems': limit})

    paginator = bucket.meta.client.get_paginator('list_objects')
    for resp in paginator.paginate(Bucket=bucket.name, **kwargs):
        q = []
        if 'CommonPrefixes' in resp and list_dirs:
            q = [S3Obj(f['Prefix'], None, None, None) for f in resp['CommonPrefixes']]
        if 'Contents' in resp and list_objs:
            q += [S3Obj(f['Key'], f['LastModified'], f['Size'], f['ETag']) for f in resp['Contents']]
        # note: even with sorted lists, it is faster to sort(a+b)
        # than heapq.merge(a, b) at least up to 10K elements in each list
        q = sorted(q, key=attrgetter('key'))
        if limit is not None:
            q = q[:limit]
            limit -= len(q)
        for p in q:
            if end is not None and p.key >= end:
                return
            yield p


def __prev_str(s):
    if len(s) == 0:
        return s
    s, c = s[:-1], ord(s[-1])
    if c > 0:
        s += chr(c - 1)
    s += ''.join(['\u7FFF' for _ in range(10)])
    return s

测试

以下内容有助于测试paginatorlist_objects的行为。它会创建许多目录和文件。由于页面最多可包含1000个条目,因此对于目录和文件,我们使用其倍数。 dirs仅包含目录(每个目录都有一个对象)。 mixed包含目录和对象的混合,每个目录的比率为2个对象(当然,目录下还有一个对象; S3仅存储对象)。

import concurrent
def genkeys(top='tmp/test', n=2000):
    for k in range(n):
        if k % 100 == 0:
            print(k)
        for name in [
            os.path.join(top, 'dirs', f'{k:04d}_dir', 'foo'),
            os.path.join(top, 'mixed', f'{k:04d}_dir', 'foo'),
            os.path.join(top, 'mixed', f'{k:04d}_foo_a'),
            os.path.join(top, 'mixed', f'{k:04d}_foo_b'),
        ]:
            yield name


with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
    executor.map(lambda name: bucket.put_object(Key=name, Body='hi\n'.encode()), genkeys())

结果结构为:

./dirs/0000_dir/foo
./dirs/0001_dir/foo
./dirs/0002_dir/foo
...
./dirs/1999_dir/foo
./mixed/0000_dir/foo
./mixed/0000_foo_a
./mixed/0000_foo_b
./mixed/0001_dir/foo
./mixed/0001_foo_a
./mixed/0001_foo_b
./mixed/0002_dir/foo
./mixed/0002_foo_a
./mixed/0002_foo_b
...
./mixed/1999_dir/foo
./mixed/1999_foo_a
./mixed/1999_foo_b

对上面给出的s3list进行了一些修改,以检查来自paginator的响应,您可以观察到一些有趣的事实:

  • Marker确实是排他的。给定Marker=topdir + 'mixed/0500_foo_a',将使列表开始在该键的之后(根据AmazonS3 API),即,以.../mixed/0500_foo_b开始。这就是__prev_str()的原因。

  • 使用Delimiter列出mixed/时,来自paginator的每个响应都包含666个键和334个公共前缀。很好地避免了巨大的反应。

  • 相比之下,列出dirs/时,来自paginator的每个响应都包含1000个公共前缀(并且没有键)。

  • PaginationConfig={'MaxItems': limit}的形式传递限制仅限制键的数量,而不限制公共前缀。我们通过进一步截断迭代器流来解决这个问题。

答案 4 :(得分:11)

我遇到了同样的问题,但设法使用boto3.clientlist_objects_v2BucketStartAfter参数解决了这个问题。

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    print object['Key']

上面代码的输出结果将显示以下内容:

firstlevelFolder/secondLevelFolder/item1
firstlevelFolder/secondLevelFolder/item2

Boto3 list_objects_v2 Documentation

为了仅删除secondLevelFolder的目录名,我只使用了python方法split()

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    direcoryName = object['Key']..encode("string_escape").split('/')
    print direcoryName[1]

上面代码的输出结果将显示以下内容:

secondLevelFolder
secondLevelFolder

Python split() Documentation

如果您想获取目录名称和内容项目名称,请使用以下内容替换打印行:

print "{}/{}".format(fileName[1], fileName[2])

以下内容将输出:

secondLevelFolder/item2
secondLevelFolder/item2

希望这有帮助

答案 5 :(得分:8)

答案 6 :(得分:7)

S3的最大实现是没有文件夹/目录只是键。 明显的文件夹结构只是在文件名前面加上'键,以列出myBucket some/path/to/the/file/的内容你可以尝试:

s3 = boto3.client('s3')
for obj in s3.list_objects_v2(Bucket="myBucket", Prefix="some/path/to/the/file/")['Contents']:
    print(obj['Key'])

会给你类似的东西:

some/path/to/the/file/yoMumma.jpg
some/path/to/the/file/meAndYoMuma.gif
...

答案 7 :(得分:5)

以下适用于我... S3对象:

def smoothing(dataDf, selected_columns, kwargs):
    return globals().get(
        kwargs.pop('method') or 'NOT_IMPLEMENTED_FOR_SURE!!', 
        lambda *_: (_ for _ in ()).throw(NotImplementedError())
    )(dataDf, selected_columns, kwargs['arguments'])

使用:

s3://bucket/
    form1/
       section11/
          file111
          file112
       section12/
          file121
    form2/
       section21/
          file211
          file112
       section22/
          file221
          file222
          ...
      ...
   ...

我们得到:

from boto3.session import Session
s3client = session.client('s3')
resp = s3client.list_objects(Bucket=bucket, Prefix='', Delimiter="/")
forms = [x['Prefix'] for x in resp['CommonPrefixes']] 

使用:

form1/
form2/
...

我们得到:

resp = s3client.list_objects(Bucket=bucket, Prefix='form1/', Delimiter="/")
sections = [x['Prefix'] for x in resp['CommonPrefixes']] 

答案 8 :(得分:4)

当您运行aws s3 ls s3://my-bucket/时,AWS cli会执行此操作(可能无法获取并遍历存储桶中的所有键),因此我认为必须使用boto3。

https://github.com/aws/aws-cli/blob/0fedc4c1b6a7aee13e2ed10c3ada778c702c22c3/awscli/customizations/s3/subcommands.py#L499

看起来他们确实使用了Prefix和Delimiter - 我能够编写一个函数,通过稍微修改一下代码来获取存储桶根目录下的所有目录:

def list_folders_in_bucket(bucket):
    paginator = boto3.client('s3').get_paginator('list_objects')
    folders = []
    iterator = paginator.paginate(Bucket=bucket, Prefix='', Delimiter='/', PaginationConfig={'PageSize': None})
    for response_data in iterator:
        prefixes = response_data.get('CommonPrefixes', [])
        for prefix in prefixes:
            prefix_name = prefix['Prefix']
            if prefix_name.endswith('/'):
                folders.append(prefix_name.rstrip('/'))
    return folders

答案 9 :(得分:2)

这是一个可能的解决方案:

def download_list_s3_folder(my_bucket,my_folder):
    import boto3
    s3 = boto3.client('s3')
    response = s3.list_objects_v2(
        Bucket=my_bucket,
        Prefix=my_folder,
        MaxKeys=1000)
    return [item["Key"] for item in response['Contents']]

答案 10 :(得分:1)

首先,S3中没有真正的文件夹概念。 你绝对可以拥有一个文件@ '/folder/subfolder/myfile.txt',没有文件夹或子文件夹。

要在S3中“模拟”文件夹,您必须在其名称的末尾创建一个带有“/”的空文件(请参阅Amazon S3 boto - how to create a folder?

对于您的问题,您应该使用方法get_all_keys和2个参数:prefixdelimiter

https://github.com/boto/boto/blob/develop/boto/s3/bucket.py#L427

for key in bucket.get_all_keys(prefix='first-level/', delimiter='/'):
    print(key.name)

答案 11 :(得分:1)

这对我来说很有效,因为它只检索存储桶下的第一级文件夹:

client = boto3.client('s3')
bucket = 'my-bucket-name'
folders = set()

for prefix in client.list_objects(Bucket=bucket, Delimiter='/')['CommonPrefixes']:
    folders.add(prefix['Prefix'][:-1])
    
print(folders)

您可以对列表而不是集合执行相同操作,因为文件夹名称是唯一的

答案 12 :(得分:0)

使用0.047088712933431365 0.7171912713398885 0.46406612257995117

这基于answer by itz-azhar来应用可选的boto3.resource。显然,它比limit版本要简单得多。

boto3.client

使用import logging from typing import List, Optional import boto3 from boto3_type_annotations.s3 import ObjectSummary # pip install boto3_type_annotations log = logging.getLogger(__name__) _S3_RESOURCE = boto3.resource("s3") def s3_list(bucket_name: str, prefix: str, *, limit: Optional[int] = None) -> List[ObjectSummary]: """Return a list of S3 object summaries.""" # Ref: https://stackoverflow.com/a/57718002/ return list(_S3_RESOURCE.Bucket(bucket_name).objects.limit(count=limit).filter(Prefix=prefix)) if __name__ == "__main__": s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)

这使用boto3.client并建立在answer by CpILL的基础上,以允许检索1000个以上的对象。

list_objects_v2

答案 13 :(得分:0)

我知道boto3是这里要讨论的主题,但是我发现通常使用 awscli 这样的命令通常更快,更直观-awscli保留了更多的功能, boto3的价值不菲。

例如,如果我将对象保存在与给定存储桶相关联的“子文件夹”中,则可以使用以下内容将它们全部列出:

  

1)'mydata'=存储桶名称

     

2)'f1 / f2 / f3'=指向“文件”或对象

的“路径”      

3)'foo2.csv,barfar.segy,gar.tar'=所有在f3内部的对象

因此,我们可以想到导致这些对象的“绝对路径”是: 'mydata / f1 / f2 / f3 / foo2.csv'...

使用awscli命令,我们可以通过以下方式轻松列出给定“子文件夹”内的所有对象:

  

aws s3 ls s3:// mydata / f1 / f2 / f3 /-递归

答案 14 :(得分:0)

如果您尝试获取大量的S3存储桶对象,则以下代码可以处理分页:

def get_matching_s3_objects(bucket, prefix="", suffix=""):

    s3 = boto3.client("s3")
    paginator = s3.get_paginator("list_objects_v2")

    kwargs = {'Bucket': bucket}

    # We can pass the prefix directly to the S3 API.  If the user has passed
    # a tuple or list of prefixes, we go through them one by one.
    if isinstance(prefix, str):
        prefixes = (prefix, )
    else:
        prefixes = prefix

    for key_prefix in prefixes:
        kwargs["Prefix"] = key_prefix

        for page in paginator.paginate(**kwargs):
            try:
                contents = page["Contents"]
            except KeyError:
                return

            for obj in contents:
                key = obj["Key"]
                if key.endswith(suffix):
                    yield obj

答案 15 :(得分:0)

对于Boto 1.13.3,它变得如此简单(如果您跳过所有分页注意事项,其他答案都涵盖了该问题):

def get_sub_paths(bucket, prefix):
s3 = boto3.client('s3')
response = s3.list_objects_v2(
    Bucket=bucket,
    Prefix=prefix,
    MaxKeys=1000)
return [item["Prefix"] for item in response['CommonPrefixes']]

答案 16 :(得分:0)

使用递归方法列出 S3 存储桶中的所有不同路径。

def common_prefix(bucket_name,paths,prefix=''):
    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects')
    result = paginator.paginate(Bucket=bucket_name, Prefix=prefix, Delimiter='/')
    for prefix in result.search('CommonPrefixes'):
        if prefix == None:
            break
        paths.append(prefix.get('Prefix'))
        common_prefix(bucket_name,paths,prefix.get('Prefix'))

答案 17 :(得分:0)

要列出的“目录”并不是真正的对象,而是对象键的子字符串,因此它们不会出现在 objects.filter 方法中。您可以在此处使用客户端的 list_objects 并指定前缀。

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')
res = bucket.meta.client.list_objects(Bucket=bucket.name, Delimiter='/', Prefix = 'sub-folder/')
for o in res.get('CommonPrefixes'):
    print(o.get('Prefix'))

答案 18 :(得分:0)

这个问题的一些很好的答案。

我一直在使用 boto3 资源 objects.filter 方法来获取所有文件。
objects.filter 方法作为迭代器返回并且速度非常快。
虽然将其转换为列表很耗时。

list_objects_v2 返回实际内容而不是迭代器。
但是您需要循环获取所有内容,因为它的大小限制为 1000。

为了只获取文件夹,我像这样应用列表理解

[x.split('/')[index] for x in files]

以下是各种方法所花费的时间。
运行这些测试时,文件数为 125077。

%%timeit

s3 = boto3.resource('s3')
response = s3.Bucket('bucket').objects.filter(Prefix='foo/bar/')
3.95 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit

s3 = boto3.resource('s3')
response = s3.Bucket('foo').objects.filter(Prefix='foo/bar/')
files = list(response)
26.6 s ± 1.08 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit

s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket='bucket', Prefix='foo/bar/')
files = response['Contents']
while 'NextContinuationToken' in response:
    response = s3.list_objects_v2(Bucket='bucket', Prefix='foo/bar/', ContinuationToken=response['NextContinuationToken'])
    files.extend(response['Contents'])
22.8 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)