如何使用Google Cloud API获取给定存储桶中的文件夹列表

时间:2016-05-06 14:28:57

标签: python google-cloud-storage google-api-python-client

我希望使用Google云端存储API获取指定Google Cloud存储分区或文件夹中的所有文件夹。

例如,如果gs://abc/xyz包含三个文件夹gs://abc/xyz/x1gs://abc/xyz/x2gs://abc/xyz/x3。 API应返回gs://abc/xyz中的所有三个文件夹。

可以使用gsutil

轻松完成

gsutil ls gs://abc/xyz

但我需要使用python和Google Cloud Storage API。

10 个答案:

答案 0 :(得分:5)

我还需要简单列出存储桶中的内容。理想情况下,我想要类似于tf.gfile提供的内容。 tf.gfile支持确定条目是文件还是目录。

我尝试了上述@jterrace提供的各种链接,但结果并非最佳。话虽如此,但值得展示结果。

给出一个混合了“目录”和“文件”的存储桶,很难在“文件系统”中导航以找到感兴趣的项。我在代码中提供了一些注释 关于上面引用的代码的工作原理。

在任何一种情况下,我都使用一个datalab笔记本,该笔记本包含凭证。给定结果,我将需要使用字符串解析来确定特定目录中的文件。如果有人知道如何扩展这些方法或替代方法来解析类似于tf.gfile的目录,请答复。

方法一

import sys
import json
import argparse
import googleapiclient.discovery

BUCKET = 'bucket-sounds' 

def create_service():
    return googleapiclient.discovery.build('storage', 'v1')


def list_bucket(bucket):
    """Returns a list of metadata of the objects within the given bucket."""
    service = create_service()

    # Create a request to objects.list to retrieve a list of objects.
    fields_to_return = 'nextPageToken,items(name,size,contentType,metadata(my-key))'
    #req = service.objects().list(bucket=bucket, fields=fields_to_return)  # returns everything
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound')  # returns everything. UrbanSound is top dir in bucket
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREE') # returns the file FREESOUNDCREDITS.TXT
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREESOUNDCREDITS.txt', delimiter='/') # same as above
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark', delimiter='/') # returns nothing
    req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark/', delimiter='/') # returns files in dog_bark dir

    all_objects = []
    # If you have too many items to list in one request, list_next() will
    # automatically handle paging with the pageToken.
    while req:
        resp = req.execute()
        all_objects.extend(resp.get('items', []))
        req = service.objects().list_next(req, resp)
    return all_objects

# usage
print(json.dumps(list_bucket(BUCKET), indent=2))

这会产生如下结果:

[
  {
    "contentType": "text/csv", 
    "name": "UrbanSound/data/dog_bark/100032.csv", 
    "size": "29"
  }, 
  {
    "contentType": "application/json", 
    "name": "UrbanSound/data/dog_bark/100032.json", 
    "size": "1858"
  } stuff snipped]

方法二

import re
import sys
from google.cloud import storage

BUCKET = 'bucket-sounds'

# Create a Cloud Storage client.
gcs = storage.Client()

# Get the bucket that the file will be uploaded to.
bucket = gcs.get_bucket(BUCKET)

def my_list_bucket(bucket_name, limit=sys.maxsize):
  a_bucket = gcs.lookup_bucket(bucket_name)
  bucket_iterator = a_bucket.list_blobs()
  for resource in bucket_iterator:
    print(resource.name)
    limit = limit - 1
    if limit <= 0:
      break

my_list_bucket(BUCKET, limit=5)

这会生成这样的输出。

UrbanSound/FREESOUNDCREDITS.txt
UrbanSound/UrbanSound_README.txt
UrbanSound/data/air_conditioner/100852.csv
UrbanSound/data/air_conditioner/100852.json
UrbanSound/data/air_conditioner/100852.mp3

答案 1 :(得分:3)

您可以使用Python GCS API客户端库。有关文档和下载的相关链接,请参阅Samples and Libraries for Google Cloud Storage文档页面。

在您的情况下,首先我要指出您对“&#34; bucket&#34;”这一术语感到困惑。我建议您阅读文档的Key Terms页面。您正在谈论的是对象名称前缀。

您可以从GitHub上的list-objects.py示例开始。查看list参考页面,您需要传递prefix=abc/xyzdelimiter=/

答案 2 :(得分:1)

这是此答案主题的更新:

from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# Get GCS bucket
bucket = storage_client.get_bucket(bucket_name)

# Get blobs in bucket (including all subdirectories)
blobs_all = list(bucket.list_blobs())

# Get blobs in specific subirectory
blobs_specific = list(bucket.list_blobs(prefix='path/to/subfolder/'))

答案 3 :(得分:1)

我遇到了同样的问题,并设法使用here中所述的标准list_blobs来解决了这个问题:

from google.cloud import storage

storage_client = storage.Client()

# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(
    bucket_name, prefix=prefix, delimiter=delimiter
)

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

但是,这仅在我阅读AntPhitlok answer后才起作用,并且了解我必须确保前缀以/结尾并且我还使用/作为定界符

由于这个事实,在“文件夹:”部分下,如果前缀文件夹下存在文件名,我们将仅获得文件名,而不是文件夹。 所有子目录都将在“前缀:”部分列出。

请注意,blobs实际上是一个迭代器,因此要获取子目录,我们必须“打开”它。因此,在我们的代码中忽略“ Blobs:”部分,将导致set()内的blobs.prefixes为空

编辑: 用法示例-假设我有一个名为buck的存储桶,并且在其中有一个名为dir的目录。在dir中,我还有一个名为subdir的目录。

为了列出dir中的目录,我可以使用:

from google.cloud import storage

storage_client = storage.Client()
blobs = storage_client.list_blobs('buck', prefix='dir/', delimiter='/')

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

*请注意,/位于前缀末尾并用作分隔符。

此呼叫将为我打印以下内容:

Prefixes:
subdir/

答案 4 :(得分:1)

这是一个简单的解决方案

from google.cloud import storage # !pip install --upgrade google-cloud-storage
import os

# set up your bucket 
client = storage.Client()
storage_client = storage.Client.from_service_account_json('XXXXXXXX')
bucket = client.get_bucket('XXXXXXXX')

# get all the folder in folder "base_folder"
base_folder = 'model_testing'
blobs=list(bucket.list_blobs(prefix=base_folder))
folders = list(set([os.path.dirname(k.name) for k in blobs]))
print(*folders, sep = '\n')

如果您只想要所选文件夹中的文件夹

base_folder = base_folder.rstrip(os.sep) # needed to remove any slashes at the end of the string 
one_out = list(set([base_folder+ os.sep.join(k.split(base_folder)[-1].split(os.sep)[:2]) for k in folders]))
print(*one_out, sep = '\n')

当然而不是使用

list(set())

你可以使用 numpy

import numpy as np
np.unique()

答案 5 :(得分:0)

要获取存储桶中的文件夹列表,可以使用以下代码片段:

import googleapiclient.discovery


def list_sub_directories(bucket_name, prefix):
    """Returns a list of sub-directories within the given bucket."""
    service = googleapiclient.discovery.build('storage', 'v1')

    req = service.objects().list(bucket=bucket_name, prefix=prefix, delimiter='/')
    res = req.execute()
    return res['prefixes']

# For the example (gs://abc/xyz), bucket_name is 'abc' and the prefix would be 'xyz/'
print(list_sub_directories(bucket_name='abc', prefix='xyz/'))

答案 6 :(得分:0)

# sudo pip3 install --upgrade google-cloud-storage
from google.cloud import storage

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "./key.json"
storage_client = storage.Client()
bucket = storage_client.get_bucket("my-bucket")
blobs = list(bucket.list_blobs(prefix='dir/'))
print (blobs)

答案 7 :(得分:0)

这个问题是关于在存储桶/文件夹中列出文件夹。没有任何建议对我有用,并且在试用了google.cloud.storage SDK之后,我认为(截至2019年11月)不可能列出存储桶中任何路径的子目录。 REST API可以实现,所以我写了这个小包装...

from google.api_core import page_iterator
from google.cloud import storage

def _item_to_value(iterator, item):
    return item

def list_directories(bucket_name, path):
    if not path.endswith('/'):
        path += '/'

    extra_params = {
        "projection": "noAcl",
        "prefix": path,
        "delimiter": '/'
    }

    gcs = storage.Client()

    path = "/b/" + bucket_name + "/o"

    iterator = page_iterator.HTTPIterator(
        client=gcs,
        api_request=gcs._connection.api_request,
        path=path,
        items_key='prefixes',
        item_to_value=_item_to_value,
        extra_params=extra_params,
    )

    return [x for x in iterator]

例如,如果您的my-bucket包含:

  • 狗吠
    • 数据集
      • v1
      • v2

然后调用list_directories('my-bucket', 'dog-bark/datasets')将返回:

['dog-bark/datasets/v1', 'dog-bark/datasets/v2']

答案 8 :(得分:0)

#python notebook
ret_folders = !gsutil ls $path_possible_with_regex | grep -e "/$"
ret_folders_no_subdir = [x for x in srr_folders if x.split("/")[-2] != "SUBDIR”]

您可以将条件编辑为适合您的条件。就我而言,我只需要更深层次的“文件夹”。对于保存级别文件夹,您可以替换为

 x.split("/")[-2] == "SUBDIR”

答案 9 :(得分:0)

这是获取所有子文件夹的简单方法:

from google.cloud import storage


def get_subdirs(bucket_name, dir_name=None):
    """
    List all subdirectories for a bucket or
    a specific folder in a bucket. If `dir_name`
    is left blank, it will list all directories in the bucket.
    """
    client = storage.Client()
    bucket = client.lookup_bucket(bucket_name)

    all_folders = []
    for resource in bucket.list_blobs(prefix=dir_name):

        # filter for directories only
        n = resource.name
        if n.endswith("/"):
            all_folders.append(n)

    return all_folders

# Use as follows:
all_folders = get_subdirs("my-bucket")