使用Spark Scala将结构化数据转换为JSON格式

时间:2019-09-20 04:05:27

标签: scala apache-spark apache-spark-sql

我具有如下所示的“结构化数据”,我需要将其转换为以下所示的“预期结果”类型。我的“输出模式”也显示出来。感谢您能否提供一些有关如何使用Spark Scala代码实现此目的的帮助。

注意:在要进行的结构化数据分组上,列SNVIN进行了分组。 相同的SNVIN应该有一行,如果SNVIN发生变化,则数据将出现在下一行。

结构化数据:

+-----------------+-------------+--------------------+---+
|VIN              |ST           |SV                  |SN |
|FU74HZ501740XXXXX|1566799999225|44.0                |APP|
|FU74HZ501740XXXXX|1566800002758|61.0                |APP|
|FU74HZ501740XXXXX|1566800009446|23.39               |ASP|

预期结果:

enter image description here

输出架构:

val outputSchema = StructType(
  List(
    StructField("VIN", StringType, true),
    StructField("EVENTS", ArrayType(
        StructType(Array(
          StructField("SN", StringType, true),
          StructField("ST", IntegerType, true),
          StructField("SV", DoubleType, true)
        ))))
  )
)

2 个答案:

答案 0 :(得分:3)

在Spark 2.1中,您可以使用structcollect_list来实现。

val df_2 = Seq(
  ("FU74HZ501740XXXX",1566799999225.0,44.0,"APP"),
  ("FU74HZ501740XXXX",1566800002758.0,61.0,"APP"),
  ("FU74HZ501740XXXX",1566800009446.0,23.39,"ASP")
).toDF("vin","st","sv","sn") 

df_2.show(false)
+----------------+-----------------+-----+---+
|vin             |st               |sv   |sn |
+----------------+-----------------+-----+---+
|FU74HZ501740XXXX|1.566799999225E12|44.0 |APP|
|FU74HZ501740XXXX|1.566800002758E12|61.0 |APP|
|FU74HZ501740XXXX|1.566800009446E12|23.39|ASP|
+----------------+-----------------+-----+---+

collect_liststruct一起使用:

df_2.groupBy("vin","sn")
  .agg(collect_list(struct($"st", $"sv",$"sn")).as("events"))
  .withColumn("events",to_json($"events"))
  .drop(col("sn"))

这将给出不需要的输出:

+----------------+---------------------------------------------------------------------------------------------+
|vin             |events                                                                                       |
+----------------+---------------------------------------------------------------------------------------------+
|FU74HZ501740XXXX|[{"st":1.566800009446E12,"sv":23.39,"sn":"ASP"}]                                             |
|FU74HZ501740XXXX|[{"st":1.566799999225E12,"sv":44.0,"sn":"APP"},{"st":1.566800002758E12,"sv":61.0,"sn":"APP"}]|
+----------------+---------------------------------------------------------------------------------------------+

答案 1 :(得分:1)

您可以通过SparkSession获得它。


val df = spark.read.json("/path/to/json/file/test.json")

这里spark是SparkSession对象