从现有存储桶为AutoML Vision生成CSV导入文件

时间:2019-10-09 16:21:31

标签: google-cloud-platform google-cloud-storage automl

我已经有一个按标签划分的GCloud存储桶,如下所示:

gs://my_bucket/dataset/label1/
gs://my_bucket/dataset/label2/
...

每个标签文件夹中都有照片。我想生成所需的CSV-as explained here-但考虑到每个文件夹中都有数百张照片,我不知道如何以编程方式进行操作。 CSV文件应如下所示:

gs://my_bucket/dataset/label1/photo1.jpg,label1
gs://my_bucket/dataset/label1/photo12.jpg,label1
gs://my_bucket/dataset/label2/photo7.jpg,label2
...

2 个答案:

答案 0 :(得分:0)

您需要列出数据集文件夹内的所有文件及其完整路径,然后对其进行解析以获得包含该文件的文件夹的名称,在这种情况下,这就是您要使用的标签。这可以通过几种不同的方式来完成。我将提供两个示例,您可以基于这些示例建立代码:

Gsutil有一个method that lists bucket contents,那么您可以使用bash脚本解析该字符串:

 # Create csv file and define bucket path
bucket_path="gs://buckbuckbuckbuck/dataset/"
filename="labels_csv_bash.csv"
touch $filename

IFS=$'\n' # Internal field separator variable has to be set to separate on new lines

# List of every .jpg file inside the buckets folder. ** searches for them recursively.
for i in `gsutil ls $bucket_path**.jpg`
do
        # Cuts the address using the / limiter and gets the second item starting from the end.
        label=$(echo $i | rev | cut -d'/' -f2 | rev)
        echo "$i, $label" >> $filename
done

IFS=' ' # Reset to originnal value

gsutil cp $filename $bucket_path

也可以使用为不同语言提供的Google Cloud Client libraries来完成。这里有一个使用python的示例:

# Imports the Google Cloud client library
import os
from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# The name for the new bucket
bucket_name = 'my_bucket'
path_in_bucket = 'dataset'

blobs = storage_client.list_blobs(bucket_name, prefix=path_in_bucket)

# Reading blobs, parsing information and creating the csv file
filename = 'labels_csv_python.csv'
with open(filename, 'w+') as f:
    for blob in blobs:
        if '.jpg' in blob.name:
            bucket_path = 'gs://' + os.path.join(bucket_name, blob.name)
            label = blob.name.split('/')[-2]
            f.write(', '.join([bucket_path, label]))
            f.write("\n")

# Uploading csv file to the bucket
bucket = storage_client.get_bucket(bucket_name)
destination_blob_name = os.path.join(path_in_bucket, filename)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(filename)

答案 1 :(得分:0)

对于像我一样的人,他们正在寻找一种方法来创建.csv文件以在googleAutoML中进行批处理,但是不需要标签列:

# Create csv file and define bucket path
bucket_path="gs:YOUR_BUCKET/FOLDER"
filename="THE_FILENAME_YOU_WANT.csv"
touch $filename

IFS=$'\n' # Internal field separator variable has to be set to separate on new lines

# List of every [YOUREXTENSION] file inside the buckets folder - change in next line - ie **.png beceomes **.your_extension. ** searches for them recursively.
for i in `gsutil ls $bucket_path**.png`
do

       echo "$i" >> $filename
done

IFS=' ' # Reset to originnal value

gsutil cp $filename $bucket_path