Question

我希望使用Google云端存储API获取指定Google Cloud存储分区或文件夹中的所有文件夹。

例如，如果gs://abc/xyz包含三个文件夹gs://abc/xyz/x1，gs://abc/xyz/x2和gs://abc/xyz/x3。 API应返回gs://abc/xyz中的所有三个文件夹。

可以使用gsutil

轻松完成

gsutil ls gs://abc/xyz

但我需要使用python和Google Cloud Storage API。

Answer 1

我还需要简单列出存储桶中的内容。理想情况下，我想要类似于tf.gfile提供的内容。 tf.gfile支持确定条目是文件还是目录。

我尝试了上述@jterrace提供的各种链接，但结果并非最佳。话虽如此，但值得展示结果。

给出一个混合了“目录”和“文件”的存储桶，很难在“文件系统”中导航以找到感兴趣的项。我在代码中提供了一些注释关于上面引用的代码的工作原理。

在任何一种情况下，我都使用一个datalab笔记本，该笔记本包含凭证。给定结果，我将需要使用字符串解析来确定特定目录中的文件。如果有人知道如何扩展这些方法或替代方法来解析类似于tf.gfile的目录，请答复。

方法一

import sys
import json
import argparse
import googleapiclient.discovery

BUCKET = 'bucket-sounds' 

def create_service():
    return googleapiclient.discovery.build('storage', 'v1')


def list_bucket(bucket):
    """Returns a list of metadata of the objects within the given bucket."""
    service = create_service()

    # Create a request to objects.list to retrieve a list of objects.
    fields_to_return = 'nextPageToken,items(name,size,contentType,metadata(my-key))'
    #req = service.objects().list(bucket=bucket, fields=fields_to_return)  # returns everything
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound')  # returns everything. UrbanSound is top dir in bucket
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREE') # returns the file FREESOUNDCREDITS.TXT
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREESOUNDCREDITS.txt', delimiter='/') # same as above
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark', delimiter='/') # returns nothing
    req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark/', delimiter='/') # returns files in dog_bark dir

    all_objects = []
    # If you have too many items to list in one request, list_next() will
    # automatically handle paging with the pageToken.
    while req:
        resp = req.execute()
        all_objects.extend(resp.get('items', []))
        req = service.objects().list_next(req, resp)
    return all_objects

# usage
print(json.dumps(list_bucket(BUCKET), indent=2))

这会产生如下结果：

[
  {
    "contentType": "text/csv", 
    "name": "UrbanSound/data/dog_bark/100032.csv", 
    "size": "29"
  }, 
  {
    "contentType": "application/json", 
    "name": "UrbanSound/data/dog_bark/100032.json", 
    "size": "1858"
  } stuff snipped]

方法二

import re
import sys
from google.cloud import storage

BUCKET = 'bucket-sounds'

# Create a Cloud Storage client.
gcs = storage.Client()

# Get the bucket that the file will be uploaded to.
bucket = gcs.get_bucket(BUCKET)

def my_list_bucket(bucket_name, limit=sys.maxsize):
  a_bucket = gcs.lookup_bucket(bucket_name)
  bucket_iterator = a_bucket.list_blobs()
  for resource in bucket_iterator:
    print(resource.name)
    limit = limit - 1
    if limit <= 0:
      break

my_list_bucket(BUCKET, limit=5)

这会生成这样的输出。

UrbanSound/FREESOUNDCREDITS.txt
UrbanSound/UrbanSound_README.txt
UrbanSound/data/air_conditioner/100852.csv
UrbanSound/data/air_conditioner/100852.json
UrbanSound/data/air_conditioner/100852.mp3

Answer 2

您可以使用Python GCS API客户端库。有关文档和下载的相关链接，请参阅Samples and Libraries for Google Cloud Storage文档页面。

在您的情况下，首先我要指出您对“＆＃34; bucket＆＃34;”这一术语感到困惑。我建议您阅读文档的Key Terms页面。您正在谈论的是对象名称前缀。

您可以从GitHub上的list-objects.py示例开始。查看list参考页面，您需要传递prefix=abc/xyz和delimiter=/。

Answer 3

这是此答案主题的更新：

from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# Get GCS bucket
bucket = storage_client.get_bucket(bucket_name)

# Get blobs in bucket (including all subdirectories)
blobs_all = list(bucket.list_blobs())

# Get blobs in specific subirectory
blobs_specific = list(bucket.list_blobs(prefix='path/to/subfolder/'))

Answer 4

我遇到了同样的问题，并设法使用here中所述的标准list_blobs来解决了这个问题：

from google.cloud import storage

storage_client = storage.Client()

# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(
    bucket_name, prefix=prefix, delimiter=delimiter
)

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

但是，这仅在我阅读AntPhitlok answer后才起作用，并且了解我必须确保前缀以/结尾并且我还使用/作为定界符

由于这个事实，在“文件夹：”部分下，如果前缀文件夹下存在文件名，我们将仅获得文件名，而不是文件夹。所有子目录都将在“前缀：”部分列出。

请注意，blobs实际上是一个迭代器，因此要获取子目录，我们必须“打开”它。因此，在我们的代码中忽略“ Blobs：”部分，将导致set()内的blobs.prefixes为空

编辑： 用法示例-假设我有一个名为buck的存储桶，并且在其中有一个名为dir的目录。在dir中，我还有一个名为subdir的目录。

为了列出dir中的目录，我可以使用：

from google.cloud import storage

storage_client = storage.Client()
blobs = storage_client.list_blobs('buck', prefix='dir/', delimiter='/')

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

*请注意，/位于前缀末尾并用作分隔符。

此呼叫将为我打印以下内容：

Prefixes:
subdir/

Answer 5

这是一个简单的解决方案

from google.cloud import storage # !pip install --upgrade google-cloud-storage
import os

# set up your bucket 
client = storage.Client()
storage_client = storage.Client.from_service_account_json('XXXXXXXX')
bucket = client.get_bucket('XXXXXXXX')

# get all the folder in folder "base_folder"
base_folder = 'model_testing'
blobs=list(bucket.list_blobs(prefix=base_folder))
folders = list(set([os.path.dirname(k.name) for k in blobs]))
print(*folders, sep = '\n')

如果您只想要所选文件夹中的文件夹

base_folder = base_folder.rstrip(os.sep) # needed to remove any slashes at the end of the string 
one_out = list(set([base_folder+ os.sep.join(k.split(base_folder)[-1].split(os.sep)[:2]) for k in folders]))
print(*one_out, sep = '\n')

当然而不是使用

list(set())

你可以使用 numpy

import numpy as np
np.unique()

Answer 6

要获取存储桶中的文件夹列表，可以使用以下代码片段：

import googleapiclient.discovery


def list_sub_directories(bucket_name, prefix):
    """Returns a list of sub-directories within the given bucket."""
    service = googleapiclient.discovery.build('storage', 'v1')

    req = service.objects().list(bucket=bucket_name, prefix=prefix, delimiter='/')
    res = req.execute()
    return res['prefixes']

# For the example (gs://abc/xyz), bucket_name is 'abc' and the prefix would be 'xyz/'
print(list_sub_directories(bucket_name='abc', prefix='xyz/'))

Answer 7

# sudo pip3 install --upgrade google-cloud-storage
from google.cloud import storage

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "./key.json"
storage_client = storage.Client()
bucket = storage_client.get_bucket("my-bucket")
blobs = list(bucket.list_blobs(prefix='dir/'))
print (blobs)

Answer 8

这个问题是关于在存储桶/文件夹中列出文件夹。没有任何建议对我有用，并且在试用了google.cloud.storage SDK之后，我认为（截至2019年11月）不可能列出存储桶中任何路径的子目录。 REST API可以实现，所以我写了这个小包装...

from google.api_core import page_iterator
from google.cloud import storage

def _item_to_value(iterator, item):
    return item

def list_directories(bucket_name, path):
    if not path.endswith('/'):
        path += '/'

    extra_params = {
        "projection": "noAcl",
        "prefix": path,
        "delimiter": '/'
    }

    gcs = storage.Client()

    path = "/b/" + bucket_name + "/o"

    iterator = page_iterator.HTTPIterator(
        client=gcs,
        api_request=gcs._connection.api_request,
        path=path,
        items_key='prefixes',
        item_to_value=_item_to_value,
        extra_params=extra_params,
    )

    return [x for x in iterator]

例如，如果您的my-bucket包含：

狗吠
- 数据集
  - v1
  - v2

然后调用list_directories('my-bucket', 'dog-bark/datasets')将返回：

['dog-bark/datasets/v1', 'dog-bark/datasets/v2']

Answer 9

#python notebook
ret_folders = !gsutil ls $path_possible_with_regex | grep -e "/$"
ret_folders_no_subdir = [x for x in srr_folders if x.split("/")[-2] != "SUBDIR”]

您可以将条件编辑为适合您的条件。就我而言，我只需要更深层次的“文件夹”。对于保存级别文件夹，您可以替换为

 x.split("/")[-2] == "SUBDIR”

Answer 10

这是获取所有子文件夹的简单方法：

from google.cloud import storage


def get_subdirs(bucket_name, dir_name=None):
    """
    List all subdirectories for a bucket or
    a specific folder in a bucket. If `dir_name`
    is left blank, it will list all directories in the bucket.
    """
    client = storage.Client()
    bucket = client.lookup_bucket(bucket_name)

    all_folders = []
    for resource in bucket.list_blobs(prefix=dir_name):

        # filter for directories only
        n = resource.name
        if n.endswith("/"):
            all_folders.append(n)

    return all_folders

# Use as follows:
all_folders = get_subdirs("my-bucket")

如何使用Google Cloud API获取给定存储桶中的文件夹列表

10 个答案:

方法一

方法二