Question

我正在构建一个需要列出存储在AWS S3中的大量对象的应用程序（让我们说500M到10亿个对象）。通过分页直接列出对象需要数周时间。我想并行化列表，但要有效地执行此操作，我需要映射基本上是未知键空间的内容。

对于更多背景，AWS允许您提供前缀和分隔符作为ListBuckets操作的一部分。见这里：http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html

因此，看起来我需要一个可以枚举未知S3密钥空间（前缀空间？）的算法，并尝试将发现的前缀分组为大约[X]个大小相等的桶，这些桶可以并行化，以便进行均匀采样和速度到列表。

高级伪代码：

执行并行搜索，前缀为[0-9a-z] N组合（简称3个）字母，最大键大小为1k。
对于使用超过最大键数返回的任何搜索（我们不知道确切的大小），请执行前缀为[已发现前缀] + [0-9a-z]的后续GET。如果使用1-999个密钥返回搜索，请添加到单独的存储桶列表
一旦我们感到舒服，我们尽可能地分配密钥（可能在N个递归步骤之后），任务工作者开始列出他们自己的桶

挑战：

构建一个均匀的采样分布，发送给工人列表。
需要尽量减少一名工人完成所有工作的机会。
如果所有前缀都以＆＃34; aaaaaaaaaaaaaa＆＃34;？开头怎么办？ =）

目标是以编程方式发现和分组前缀，以便为未知前缀结构的列表启用均匀采样和并行化。关于算法，链接，示例的任何想法都将非常感激！

Answer 1

我创建了一个工具，该工具将使用生产者/消费者系统递归分析 S3 密钥空间，以便在不同的线程中枚举每个新发现的前缀。这是我能找到的最有效的方法。

如果有一种方法可以通过 ETag 检索对象就好了，但是如果 etags 在存储桶中是唯一的，而它们不是。

代码如下：

def search_objects(bucket, prefix=None, *, name, delimiter='/', limit=None, searchFoundPrefixes=True, threads=20):
    """Search for occurences of a name. Returns a list of all found keys as dictionaries.
    @param bucket - the bucket to search
    @param prefix - the prefix to start with
    @param name   - the name being searched for
    @param delimiter - the delimiter that separates names
    @param limit  - the maximum number of names keys to return
    @param searchFoundPrefixes - If true, do not search for prefixes below where name is found.
    @param threads - the number of Python threds to use. Note that this is all in the same process.
    """

    import queue
    import threading

    if limit is None:
        limit = sys.maxsize  # should be big enough
    ret = []

    def worker():
        while True:
            prefix = q.get()
            if prefix is None:
                break
            found_prefixes = []
            found_names = 0
            for obj in list_objects(bucket, prefix=prefix, delimiter=delimiter):
                if _Prefix in obj:
                    found_prefixes.append(obj[_Prefix])
                if (_Key in obj) and obj[_Key].split(delimiter)[-1] == name:
                    if len(ret) < limit:
                        ret.append(obj)
                if len(ret) > limit:
                    break
            if found_names == 0 or searchFoundPrefixes:
                if len(ret) < limit:
                    for lp in found_prefixes:
                        q.put(lp)
            q.task_done()

    q = queue.Queue()
    thread_pool = []
    for i in range(threads):
        t = threading.Thread(target=worker)
        t.start()
        thread_pool.append(t)
    q.put(prefix)

    # block until all tasks are done
    q.join()

    # stop workers
    for i in range(threads):
        q.put(None)
    for t in thread_pool:
        t.join()
    return ret

如何有效枚举AWS S3密钥空间

1 个答案: