我希望使用Google云端存储API获取指定Google Cloud存储分区或文件夹中的所有文件夹。
例如,如果gs://abc/xyz
包含三个文件夹gs://abc/xyz/x1
,gs://abc/xyz/x2
和gs://abc/xyz/x3
。 API应返回gs://abc/xyz
中的所有三个文件夹。
可以使用gsutil
gsutil ls gs://abc/xyz
但我需要使用python和Google Cloud Storage API。
答案 0 :(得分:5)
我还需要简单列出存储桶中的内容。理想情况下,我想要类似于tf.gfile提供的内容。 tf.gfile支持确定条目是文件还是目录。
我尝试了上述@jterrace提供的各种链接,但结果并非最佳。话虽如此,但值得展示结果。
给出一个混合了“目录”和“文件”的存储桶,很难在“文件系统”中导航以找到感兴趣的项。我在代码中提供了一些注释 关于上面引用的代码的工作原理。
在任何一种情况下,我都使用一个datalab笔记本,该笔记本包含凭证。给定结果,我将需要使用字符串解析来确定特定目录中的文件。如果有人知道如何扩展这些方法或替代方法来解析类似于tf.gfile的目录,请答复。
import sys
import json
import argparse
import googleapiclient.discovery
BUCKET = 'bucket-sounds'
def create_service():
return googleapiclient.discovery.build('storage', 'v1')
def list_bucket(bucket):
"""Returns a list of metadata of the objects within the given bucket."""
service = create_service()
# Create a request to objects.list to retrieve a list of objects.
fields_to_return = 'nextPageToken,items(name,size,contentType,metadata(my-key))'
#req = service.objects().list(bucket=bucket, fields=fields_to_return) # returns everything
#req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound') # returns everything. UrbanSound is top dir in bucket
#req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREE') # returns the file FREESOUNDCREDITS.TXT
#req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREESOUNDCREDITS.txt', delimiter='/') # same as above
#req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark', delimiter='/') # returns nothing
req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark/', delimiter='/') # returns files in dog_bark dir
all_objects = []
# If you have too many items to list in one request, list_next() will
# automatically handle paging with the pageToken.
while req:
resp = req.execute()
all_objects.extend(resp.get('items', []))
req = service.objects().list_next(req, resp)
return all_objects
# usage
print(json.dumps(list_bucket(BUCKET), indent=2))
这会产生如下结果:
[
{
"contentType": "text/csv",
"name": "UrbanSound/data/dog_bark/100032.csv",
"size": "29"
},
{
"contentType": "application/json",
"name": "UrbanSound/data/dog_bark/100032.json",
"size": "1858"
} stuff snipped]
import re
import sys
from google.cloud import storage
BUCKET = 'bucket-sounds'
# Create a Cloud Storage client.
gcs = storage.Client()
# Get the bucket that the file will be uploaded to.
bucket = gcs.get_bucket(BUCKET)
def my_list_bucket(bucket_name, limit=sys.maxsize):
a_bucket = gcs.lookup_bucket(bucket_name)
bucket_iterator = a_bucket.list_blobs()
for resource in bucket_iterator:
print(resource.name)
limit = limit - 1
if limit <= 0:
break
my_list_bucket(BUCKET, limit=5)
这会生成这样的输出。
UrbanSound/FREESOUNDCREDITS.txt
UrbanSound/UrbanSound_README.txt
UrbanSound/data/air_conditioner/100852.csv
UrbanSound/data/air_conditioner/100852.json
UrbanSound/data/air_conditioner/100852.mp3
答案 1 :(得分:3)
您可以使用Python GCS API客户端库。有关文档和下载的相关链接,请参阅Samples and Libraries for Google Cloud Storage文档页面。
在您的情况下,首先我要指出您对“&#34; bucket&#34;”这一术语感到困惑。我建议您阅读文档的Key Terms页面。您正在谈论的是对象名称前缀。
您可以从GitHub上的list-objects.py示例开始。查看list参考页面,您需要传递prefix=abc/xyz
和delimiter=/
。
答案 2 :(得分:1)
这是此答案主题的更新:
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client()
# Get GCS bucket
bucket = storage_client.get_bucket(bucket_name)
# Get blobs in bucket (including all subdirectories)
blobs_all = list(bucket.list_blobs())
# Get blobs in specific subirectory
blobs_specific = list(bucket.list_blobs(prefix='path/to/subfolder/'))
答案 3 :(得分:1)
我遇到了同样的问题,并设法使用here中所述的标准list_blobs来解决了这个问题:
from google.cloud import storage
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(
bucket_name, prefix=prefix, delimiter=delimiter
)
print("Blobs:")
for blob in blobs:
print(blob.name)
if delimiter:
print("Prefixes:")
for prefix in blobs.prefixes:
print(prefix)
但是,这仅在我阅读AntPhitlok answer后才起作用,并且了解我必须确保前缀以/
结尾并且我还使用/
作为定界符
由于这个事实,在“文件夹:”部分下,如果前缀文件夹下存在文件名,我们将仅获得文件名,而不是文件夹。 所有子目录都将在“前缀:”部分列出。
请注意,blobs
实际上是一个迭代器,因此要获取子目录,我们必须“打开”它。因此,在我们的代码中忽略“ Blobs:”部分,将导致set()
内的blobs.prefixes
为空
编辑:
用法示例-假设我有一个名为buck
的存储桶,并且在其中有一个名为dir
的目录。在dir
中,我还有一个名为subdir
的目录。
为了列出dir
中的目录,我可以使用:
from google.cloud import storage
storage_client = storage.Client()
blobs = storage_client.list_blobs('buck', prefix='dir/', delimiter='/')
print("Blobs:")
for blob in blobs:
print(blob.name)
if delimiter:
print("Prefixes:")
for prefix in blobs.prefixes:
print(prefix)
*请注意,/
位于前缀末尾并用作分隔符。
此呼叫将为我打印以下内容:
Prefixes:
subdir/
答案 4 :(得分:1)
这是一个简单的解决方案
from google.cloud import storage # !pip install --upgrade google-cloud-storage
import os
# set up your bucket
client = storage.Client()
storage_client = storage.Client.from_service_account_json('XXXXXXXX')
bucket = client.get_bucket('XXXXXXXX')
# get all the folder in folder "base_folder"
base_folder = 'model_testing'
blobs=list(bucket.list_blobs(prefix=base_folder))
folders = list(set([os.path.dirname(k.name) for k in blobs]))
print(*folders, sep = '\n')
如果您只想要所选文件夹中的文件夹
base_folder = base_folder.rstrip(os.sep) # needed to remove any slashes at the end of the string
one_out = list(set([base_folder+ os.sep.join(k.split(base_folder)[-1].split(os.sep)[:2]) for k in folders]))
print(*one_out, sep = '\n')
当然而不是使用
list(set())
你可以使用 numpy
import numpy as np
np.unique()
答案 5 :(得分:0)
要获取存储桶中的文件夹列表,可以使用以下代码片段:
import googleapiclient.discovery
def list_sub_directories(bucket_name, prefix):
"""Returns a list of sub-directories within the given bucket."""
service = googleapiclient.discovery.build('storage', 'v1')
req = service.objects().list(bucket=bucket_name, prefix=prefix, delimiter='/')
res = req.execute()
return res['prefixes']
# For the example (gs://abc/xyz), bucket_name is 'abc' and the prefix would be 'xyz/'
print(list_sub_directories(bucket_name='abc', prefix='xyz/'))
答案 6 :(得分:0)
# sudo pip3 install --upgrade google-cloud-storage
from google.cloud import storage
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "./key.json"
storage_client = storage.Client()
bucket = storage_client.get_bucket("my-bucket")
blobs = list(bucket.list_blobs(prefix='dir/'))
print (blobs)
答案 7 :(得分:0)
这个问题是关于在存储桶/文件夹中列出文件夹。没有任何建议对我有用,并且在试用了google.cloud.storage
SDK之后,我认为(截至2019年11月)不可能列出存储桶中任何路径的子目录。 REST API可以实现,所以我写了这个小包装...
from google.api_core import page_iterator
from google.cloud import storage
def _item_to_value(iterator, item):
return item
def list_directories(bucket_name, path):
if not path.endswith('/'):
path += '/'
extra_params = {
"projection": "noAcl",
"prefix": path,
"delimiter": '/'
}
gcs = storage.Client()
path = "/b/" + bucket_name + "/o"
iterator = page_iterator.HTTPIterator(
client=gcs,
api_request=gcs._connection.api_request,
path=path,
items_key='prefixes',
item_to_value=_item_to_value,
extra_params=extra_params,
)
return [x for x in iterator]
例如,如果您的my-bucket
包含:
然后调用list_directories('my-bucket', 'dog-bark/datasets')
将返回:
['dog-bark/datasets/v1', 'dog-bark/datasets/v2']
答案 8 :(得分:0)
#python notebook
ret_folders = !gsutil ls $path_possible_with_regex | grep -e "/$"
ret_folders_no_subdir = [x for x in srr_folders if x.split("/")[-2] != "SUBDIR”]
您可以将条件编辑为适合您的条件。就我而言,我只需要更深层次的“文件夹”。对于保存级别文件夹,您可以替换为
x.split("/")[-2] == "SUBDIR”
答案 9 :(得分:0)
这是获取所有子文件夹的简单方法:
from google.cloud import storage
def get_subdirs(bucket_name, dir_name=None):
"""
List all subdirectories for a bucket or
a specific folder in a bucket. If `dir_name`
is left blank, it will list all directories in the bucket.
"""
client = storage.Client()
bucket = client.lookup_bucket(bucket_name)
all_folders = []
for resource in bucket.list_blobs(prefix=dir_name):
# filter for directories only
n = resource.name
if n.endswith("/"):
all_folders.append(n)
return all_folders
# Use as follows:
all_folders = get_subdirs("my-bucket")