如何使用python boto获取亚马逊S3中的唯一文件夹列表

时间:2013-06-28 23:34:03

标签: python amazon-s3 boto

我正在使用boto和python以及amazon s3。

如果我使用

[key.name for key in list(self.bucket.list())]

然后我得到所有文件的所有密钥。

mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/

的最佳方式是什么?
1. either get all folders from s3
2. or from that list just remove the file from the last and get the unique keys of folders

我想这样做

set([re.sub("/[^/]*$","/",path) for path in mylist]

8 个答案:

答案 0 :(得分:40)

建立在sethwm的答案上:

获取顶级目录:

list(bucket.list("", "/"))

获取files

的子目录
list(bucket.list("files/", "/")

等等。

答案 1 :(得分:16)

正如j1m建议的评论方法之一所指出的那样,返回一个前缀对象。如果您在名称/路径之后,可以使用变量名称。例如:

import boto
import boto.s3

conn = boto.s3.connect_to_region('us-west-2')
bucket = conn.get_bucket(your_bucket)

folders = bucket.list("","/")
for folder in folders:
    print folder.name

答案 2 :(得分:13)

由于我不了解python或boto,这将是一个不完整的答案,但我想对问题中的基本概念发表评论。

其他一张海报是对的:S3中没有目录的概念。只有平键/值对。许多应用程序假装某些分隔符表示目录条目。例如" /"或" \"。有些应用程序就像放置一个虚拟文件一样,如果"目录"清空后,你仍然可以在列表结果中看到它。

您不必总是拉下整个存储桶并在本地进行过滤。 S3有一个分隔列表的概念,你可以在其中具体说明你认为你的路径分隔符(" /"," \"," |",& #34; foobar"等)S3会将虚拟结果返回给你,类似于你想要的。

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html( 查看分隔符标题。)

此API将为您提供一个级别的目录。所以,如果你的例子中有:

mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/

你传了一个带有前缀""的列表。和分隔符" /",你得到的结果:

mybucket/files/

如果您传入了带有前缀" mybucket / files /"的LIST。和分隔符" /",你得到的结果:

mybucket/files/pdf/

如果你传入了带有前缀" mybucket / files / pdf /"的LIST。和分隔符" /",你得到的结果:

mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/2011/

如果您想从结果集中删除pdf文件本身,那么您自己就可以独立。

现在你如何在python / boto中这样做我不知道。希望有一种方法可以通过。

答案 3 :(得分:7)

基本上S3中没有文件夹这样的东西。在内部,所有内容都存储为密钥,如果密钥名称中包含斜杠字符,客户端可能会决定将其显示为文件夹。

考虑到这一点,您应首先获取所有密钥,然后使用正则表达式过滤掉包含斜杠的路径。你现在的解决方案已经是一个良好的开端。

答案 4 :(得分:4)

I see you have successfully made the boto connection. If you only have one directory that you are interested in (like you provided in the example), I think what you can do is use prefix and delimiter that's already provided via AWS (Link).

Boto uses this feature in its bucket object, and you can retrieve a hierarchical directory information using prefix and delimiter. The bucket.list() will return a boto.s3.bucketlistresultset.BucketListResultSet object.

I tried this a couple ways, and if you do choose to use a delimiter= argument in bucket.list(), the returned object is an iterator for boto.s3.prefix.Prefix, rather than boto.s3.key.Key. In other words, if you try to retrieve the subdirectories you should put delimiter='\' and as a result, you will get an iterator for the prefix object

Both returned objects (either prefix or key object) have a .name attribute, so if you want the directory/file information as a string, you can do so by printing like below:

from boto.s3.connection import S3Connection

key_id = '...'
secret_key = '...'

# Create connection
conn = S3Connection(key_id, secret_key)

# Get list of all buckets
allbuckets = conn.get_all_buckets()
for bucket_name in allbuckets:
    print(bucket_name)

# Connet to a specific bucket
bucket = conn.get_bucket('bucket_name')

# Get subdirectory info
for key in bucket.list(prefix='sub_directory/', delimiter='/'):
    print(key.name)

答案 5 :(得分:2)

boto界面允许您列出存储桶的内容并提供该条目的前缀。 这样你就可以获得普通文件系统中目录的条目:

import boto
AWS_ACCESS_KEY_ID = '...'
AWS_SECRET_ACCESS_KEY = '...'

conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket()
bucket_entries = bucket.list(prefix='/path/to/your/directory')

for entry in bucket_entries:
    print entry

答案 6 :(得分:2)

正如其他人所说,这里的问题是文件夹不一定有密钥,所以你必须在字符串中搜索/字符并通过它找出你的文件夹。这是生成模仿文件夹结构的递归字典的一种方法。

如果您想要文件夹中的所有文件及其网址

assets = {}
  for key in self.bucket.list(str(self.org) + '/'):
    path = key.name.split('/')

    identifier = assets
  for uri in path[1:-1]:
    try:
      identifier[uri]
    except:
      identifier[uri] = {}
    identifier = identifier[uri]

    if not key.name.endswith('/'):
      identifier[path[-1]] = key.generate_url(expires_in=0, query_auth=False)

return assets

如果您只想要空文件夹

folders = {}
  for key in self.bucket.list(str(self.org) + '/'):
    path = key.name.split('/')

    identifier = folders
  for uri in path[1:-1]:
    try:
      identifier[uri]
    except:
      identifier[uri] = {}
    identifier = identifier[uri]

    if key.name.endswith('/'):
      identifier[path[-1]] = {}

return folders

然后可以稍后递归读出。

答案 7 :(得分:1)

我发现以下使用boto3可以工作:

def list_folders(s3_client, bucket_name):
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='', Delimiter='/')
    for content in response.get('CommonPrefixes', []):
        yield content.get('Prefix')

s3_client = session.client('s3')
folder_list = list_folders(s3_client, bucket_name)
for folder in folder_list:
    print('Folder found: %s' % folder)

引用: