Question

这是我从AWS GLUE的pyspark作业中获得的结果

{a:1,b:7}
{a:1,b:9}
{a:1,b:3}

但是我需要在s3上写入此数据并将其发送到JSON数组中的API 格式

[
 {a:1,b:2}, 
 {a:1,b:7}, 
 {a:1,b:9}, 
 {a:1,b:3}
]

我尝试将输出转换为DataFrame然后应用 toJSON() results = mapped_dyF.toDF() jsonResults = results.toJSON().collect()

，但现在无法使用'write_dynamic_frame.from_options'在s3上写回结果因为它需要DF，但是我的'jsonResults'现在不再是DataFrame。

Answer 1

为了将其放入JSON数组格式，我通常执行以下操作： df->包含原始数据的DataFrame。

if df.count() > 0:
    # Build the json file
    data = list()
    for row in df.collect():
        data.append({"a": row['a'],
                     "b" : row['b']
                    })

在这种情况下，我没有使用胶水write_dynamic_frame.from_options，但是我使用boto3保存文件：

import boto3
import json

s3 = boto3.resource('s3')
# Dump the json file to s3 bucket  
filename = '/{0}_batch_{1}.json'.format(str(uuid.uuid4()))
obj = s3.Object(bucket_name, filename)
obj.put(Body=json.dumps(data))

将AWS胶水输出格式化为JSON对象

1 个答案: