将 S3 存储桶中的 XML 文件转换为 JSON 并放入同一个存储桶中

时间:2021-05-10 10:04:02

标签: python amazon-s3 boto3

以下是S3存储桶中的对象结构:

s3://bucket/
    open-images/
        apple/
            images/
                file112.jpg
                ...
            pascal/
                file112.xml
                ...

目标

将 XML 文件转换为 JSON 并将文件放在 json/ 下。所以S3桶下的对象结构如下:

s3://bucket/
    open-images/
        apple/
            images/
                file112.jpg
                ...
            pascal/
                file112.xml
                ...
            json/
                file112.json
                ...

我的方法

for obj in bucket.objects.filter(Prefix="open-images/", Delimiter='jpg'):
    if "xml" in obj.key:

        # generating destination path for storing json files in sage maker instance
        xml_file_name = obj.key
        start,end = xml_file_name.split("pascal")
        dest_path = start+"json"+end
        
        # converting xml to json
        xml_file = obj.get()['Body']
        data_dict = xmltodict.parse(xml_file.read())
        xml_file.close()
        json_data = json.dumps(data_dict)
        
        # writing json file to s3
        # storing json file under the destination path in sage maker instance
        os.makedirs(start+"json")
        with open("{}.json".format(dest_path[:-4]), "w") as json_file:
            json_file.write(json_data)
            json_file.close()
        # copying the json file to s3
        os.system('aws s3 cp --recursive "./open-images/" "s3://<bucket_name>/open-images/"')
        # deleting json file from sage maker instance to avoid memory error 
        shutil.rmtree("open-images/{}/".format(start[12:]))

问题

有没有更好的方法来做到这一点?

1 个答案:

答案 0 :(得分:1)

@Tomalak 建议的更好方法是直接将 json 文件写入 S3 对象中,而不是将它们写入本地并复制到 S3。所以最终的更好更快的代码如下所示:

import os
import json
import glob
import shutil
import logging
import boto3
import xmltodict

#initiate s3 resource
s3 = boto3.resource('s3')
# select bucket
bucket_name= "<bucket_name>"
bucket = s3.Bucket(bucket_name)

for obj in bucket.objects.filter(Prefix="<key>", Delimiter='jpg'):
    
    if "xml" in obj.key:
        # generating final destination path
        xml_file_name = obj.key
        start,end = xml_file_name.split("pascal")
        dest_path = start+"json"+end
        
        # converting xml to json
        xml_file = obj.get()['Body']
        data_dict = xmltodict.parse(xml_file.read())
        xml_file.close()
        json_data = json.dumps(data_dict)

        # writing json file to s3
        object = s3.Object(bucket_name, dest_path[:-4]+'.json')
        object.put(Body=json.dumps(data_dict))