Question

我正在使用boto和python以及amazon s3。

如果我使用

[key.name for key in list(self.bucket.list())]

然后我得到所有文件的所有密钥。

mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/

的最佳方式是什么？

1. either get all folders from s3
2. or from that list just remove the file from the last and get the unique keys of folders

我想这样做

set([re.sub("/[^/]*$","/",path) for path in mylist]

Answer 1

建立在sethwm的答案上：

获取顶级目录：

list(bucket.list("", "/"))

获取files：

的子目录

list(bucket.list("files/", "/")

等等。

Answer 2

正如j1m建议的评论方法之一所指出的那样，返回一个前缀对象。如果您在名称/路径之后，可以使用变量名称。例如：

import boto
import boto.s3

conn = boto.s3.connect_to_region('us-west-2')
bucket = conn.get_bucket(your_bucket)

folders = bucket.list("","/")
for folder in folders:
    print folder.name

Answer 3

由于我不了解python或boto，这将是一个不完整的答案，但我想对问题中的基本概念发表评论。

其他一张海报是对的：S3中没有目录的概念。只有平键/值对。许多应用程序假装某些分隔符表示目录条目。例如＆＃34; /＆＃34;或＆＃34; \＆＃34;。有些应用程序就像放置一个虚拟文件一样，如果＆＃34;目录＆＃34;清空后，你仍然可以在列表结果中看到它。

您不必总是拉下整个存储桶并在本地进行过滤。 S3有一个分隔列表的概念，你可以在其中具体说明你认为你的路径分隔符（＆＃34; /＆＃34;，＆＃34; \＆＃34;，＆＃34; |＆＃34;，＆＃34; foobar＆＃34;等）S3会将虚拟结果返回给你，类似于你想要的。

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html（查看分隔符标题。）

此API将为您提供一个级别的目录。所以，如果你的例子中有：

mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/

你传了一个带有前缀＆＃34;＆＃34;的列表。和分隔符＆＃34; /＆＃34;，你得到的结果：

mybucket/files/

如果您传入了带有前缀＆＃34; mybucket / files /＆＃34;的LIST。和分隔符＆＃34; /＆＃34;，你得到的结果：

mybucket/files/pdf/

如果你传入了带有前缀＆＃34; mybucket / files / pdf /＆＃34;的LIST。和分隔符＆＃34; /＆＃34;，你得到的结果：

mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/2011/

如果您想从结果集中删除pdf文件本身，那么您自己就可以独立。

现在你如何在python / boto中这样做我不知道。希望有一种方法可以通过。

Answer 4

基本上S3中没有文件夹这样的东西。在内部，所有内容都存储为密钥，如果密钥名称中包含斜杠字符，客户端可能会决定将其显示为文件夹。

考虑到这一点，您应首先获取所有密钥，然后使用正则表达式过滤掉包含斜杠的路径。你现在的解决方案已经是一个良好的开端。

Answer 5

I see you have successfully made the boto connection. If you only have one directory that you are interested in (like you provided in the example), I think what you can do is use prefix and delimiter that's already provided via AWS (Link).

Boto uses this feature in its bucket object, and you can retrieve a hierarchical directory information using prefix and delimiter. The bucket.list() will return a boto.s3.bucketlistresultset.BucketListResultSet object.

I tried this a couple ways, and if you do choose to use a delimiter= argument in bucket.list(), the returned object is an iterator for boto.s3.prefix.Prefix, rather than boto.s3.key.Key. In other words, if you try to retrieve the subdirectories you should put delimiter='\' and as a result, you will get an iterator for the prefix object

Both returned objects (either prefix or key object) have a .name attribute, so if you want the directory/file information as a string, you can do so by printing like below:

from boto.s3.connection import S3Connection

key_id = '...'
secret_key = '...'

# Create connection
conn = S3Connection(key_id, secret_key)

# Get list of all buckets
allbuckets = conn.get_all_buckets()
for bucket_name in allbuckets:
    print(bucket_name)

# Connet to a specific bucket
bucket = conn.get_bucket('bucket_name')

# Get subdirectory info
for key in bucket.list(prefix='sub_directory/', delimiter='/'):
    print(key.name)

Answer 6

boto界面允许您列出存储桶的内容并提供该条目的前缀。这样你就可以获得普通文件系统中目录的条目：

import boto
AWS_ACCESS_KEY_ID = '...'
AWS_SECRET_ACCESS_KEY = '...'

conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket()
bucket_entries = bucket.list(prefix='/path/to/your/directory')

for entry in bucket_entries:
    print entry

Answer 7

正如其他人所说，这里的问题是文件夹不一定有密钥，所以你必须在字符串中搜索/字符并通过它找出你的文件夹。这是生成模仿文件夹结构的递归字典的一种方法。

如果您想要文件夹中的所有文件及其网址

assets = {}
  for key in self.bucket.list(str(self.org) + '/'):
    path = key.name.split('/')

    identifier = assets
  for uri in path[1:-1]:
    try:
      identifier[uri]
    except:
      identifier[uri] = {}
    identifier = identifier[uri]

    if not key.name.endswith('/'):
      identifier[path[-1]] = key.generate_url(expires_in=0, query_auth=False)

return assets

如果您只想要空文件夹

folders = {}
  for key in self.bucket.list(str(self.org) + '/'):
    path = key.name.split('/')

    identifier = folders
  for uri in path[1:-1]:
    try:
      identifier[uri]
    except:
      identifier[uri] = {}
    identifier = identifier[uri]

    if key.name.endswith('/'):
      identifier[path[-1]] = {}

return folders

然后可以稍后递归读出。

Answer 8

我发现以下使用boto3可以工作：

def list_folders(s3_client, bucket_name):
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='', Delimiter='/')
    for content in response.get('CommonPrefixes', []):
        yield content.get('Prefix')

s3_client = session.client('s3')
folder_list = list_folders(s3_client, bucket_name)
for folder in folder_list:
    print('Folder found: %s' % folder)

引用：

如何使用python boto获取亚马逊S3中的唯一文件夹列表

8 个答案: