以下是S3存储桶中的对象结构:
s3://bucket/
open-images/
apple/
images/
file112.jpg
...
pascal/
file112.xml
...
将 XML 文件转换为 JSON 并将文件放在 json/
下。所以S3桶下的对象结构如下:
s3://bucket/
open-images/
apple/
images/
file112.jpg
...
pascal/
file112.xml
...
json/
file112.json
...
for obj in bucket.objects.filter(Prefix="open-images/", Delimiter='jpg'):
if "xml" in obj.key:
# generating destination path for storing json files in sage maker instance
xml_file_name = obj.key
start,end = xml_file_name.split("pascal")
dest_path = start+"json"+end
# converting xml to json
xml_file = obj.get()['Body']
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()
json_data = json.dumps(data_dict)
# writing json file to s3
# storing json file under the destination path in sage maker instance
os.makedirs(start+"json")
with open("{}.json".format(dest_path[:-4]), "w") as json_file:
json_file.write(json_data)
json_file.close()
# copying the json file to s3
os.system('aws s3 cp --recursive "./open-images/" "s3://<bucket_name>/open-images/"')
# deleting json file from sage maker instance to avoid memory error
shutil.rmtree("open-images/{}/".format(start[12:]))
有没有更好的方法来做到这一点?
答案 0 :(得分:1)
@Tomalak 建议的更好方法是直接将 json 文件写入 S3 对象中,而不是将它们写入本地并复制到 S3。所以最终的更好更快的代码如下所示:
import os
import json
import glob
import shutil
import logging
import boto3
import xmltodict
#initiate s3 resource
s3 = boto3.resource('s3')
# select bucket
bucket_name= "<bucket_name>"
bucket = s3.Bucket(bucket_name)
for obj in bucket.objects.filter(Prefix="<key>", Delimiter='jpg'):
if "xml" in obj.key:
# generating final destination path
xml_file_name = obj.key
start,end = xml_file_name.split("pascal")
dest_path = start+"json"+end
# converting xml to json
xml_file = obj.get()['Body']
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()
json_data = json.dumps(data_dict)
# writing json file to s3
object = s3.Object(bucket_name, dest_path[:-4]+'.json')
object.put(Body=json.dumps(data_dict))