使用Spark将数据帧写入JSON数组文件

时间:2019-11-06 15:43:43

标签: scala apache-spark

我们有以下代码:

 val sparkSession = SparkSession.builder
  .master("local")
  .appName("example")
  .getOrCreate()

val data = Seq(
  Row(1, "a","b","c","d"),
  Row(5, "z","b","c","d")

)

val schema = StructType(
  List(
    StructField("id", IntegerType, true),
    StructField("f2", StringType, true),
    StructField("f3", StringType, true),
    StructField("f4", StringType, true),
    StructField("f5", StringType, true)
  )
)

val df1 = sparkSession.createDataFrame(
  sparkSession.sparkContext.parallelize(data),
  schema
)

目标是将该数据帧写入JSON数组:

[{"id":1,"f2":"a","f3":"b","f4":"c","f5":"d"},
 {"id":5,"f2":"z","f3":"b","f4":"c","f5":"d"}]

因此我们需要两个方括号,但数据帧大小超过50Gb,因此解决方案 df1.toJSON.collect.mkString(“ [”,“,”,“]”) < / strong>无效。有什么方法可以解决Spark性能良好的问题?

预先感谢

0 个答案:

没有答案