我正在尝试在s3上将DF保存为json格式。它保存为json对象文件,但是我想要json数组文件。
I have csv file on s3, which i am loading into dataframe in aws glue. after performing some transformation i am writing DF to S3 format as json. But it is creating json objects file like:
{obj1} {obj2} 但是我想将其保存为json数组文件,例如: [{obj1},{obj2}]
datasource0 = gumContext.create_dynamic_frame.from_options(connection_type =“ s3”,connection_options = {“ paths”:[s3_path],“ useS3ListImplementation”:True,“ recurse”:True},format =“ csv”,format_options = { “ withHeader”:True,“ separator”:“ |”})
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("cdw_zip_id", "string", "cdw_zip_id", "string"), ("zip_code", "string", "zip_code", "string"), ("cdw_terr_id", "string", "cdw_terr_id", "string")], transformation_ctx = "applymapping1")
applymapping2 = applymapping1.toDF() applymapping2.coalesce(1).write.format(“ org.apache.spark.sql.json”)。mode(“ overwrite”)。save(args ['DEST_PATH'])
实际是: {obj1} {obj2} 预期是: [{obj1},{obj2}]
答案 0 :(得分:0)
当调用df.write动作时,Spark会进行惰性计算,即,所有转换都将应用于一次读取操作,同时在所有节点上同时读取所有分区中的所有记录(其中存在分区)配置为执行工作负荷的
由于所有任务确实都独立写入输出,因此我们只能期望将单个记录写入目标,而不是整个json文件。
如果执行合并操作,则只能合并分区数据,而不能合并spark写入操作的行为。