我具有如下所示的“结构化数据”,我需要将其转换为以下所示的“预期结果”类型。我的“输出模式”也显示出来。感谢您能否提供一些有关如何使用Spark Scala代码实现此目的的帮助。
注意:在要进行的结构化数据分组上,列SN
和VIN
进行了分组。
相同的SN
和VIN
应该有一行,如果SN
或VIN
发生变化,则数据将出现在下一行。
结构化数据:
+-----------------+-------------+--------------------+---+
|VIN |ST |SV |SN |
|FU74HZ501740XXXXX|1566799999225|44.0 |APP|
|FU74HZ501740XXXXX|1566800002758|61.0 |APP|
|FU74HZ501740XXXXX|1566800009446|23.39 |ASP|
预期结果:
输出架构:
val outputSchema = StructType(
List(
StructField("VIN", StringType, true),
StructField("EVENTS", ArrayType(
StructType(Array(
StructField("SN", StringType, true),
StructField("ST", IntegerType, true),
StructField("SV", DoubleType, true)
))))
)
)
答案 0 :(得分:3)
在Spark 2.1中,您可以使用struct
和collect_list
来实现。
val df_2 = Seq(
("FU74HZ501740XXXX",1566799999225.0,44.0,"APP"),
("FU74HZ501740XXXX",1566800002758.0,61.0,"APP"),
("FU74HZ501740XXXX",1566800009446.0,23.39,"ASP")
).toDF("vin","st","sv","sn")
df_2.show(false)
+----------------+-----------------+-----+---+
|vin |st |sv |sn |
+----------------+-----------------+-----+---+
|FU74HZ501740XXXX|1.566799999225E12|44.0 |APP|
|FU74HZ501740XXXX|1.566800002758E12|61.0 |APP|
|FU74HZ501740XXXX|1.566800009446E12|23.39|ASP|
+----------------+-----------------+-----+---+
将collect_list
与struct
一起使用:
df_2.groupBy("vin","sn")
.agg(collect_list(struct($"st", $"sv",$"sn")).as("events"))
.withColumn("events",to_json($"events"))
.drop(col("sn"))
这将给出不需要的输出:
+----------------+---------------------------------------------------------------------------------------------+
|vin |events |
+----------------+---------------------------------------------------------------------------------------------+
|FU74HZ501740XXXX|[{"st":1.566800009446E12,"sv":23.39,"sn":"ASP"}] |
|FU74HZ501740XXXX|[{"st":1.566799999225E12,"sv":44.0,"sn":"APP"},{"st":1.566800002758E12,"sv":61.0,"sn":"APP"}]|
+----------------+---------------------------------------------------------------------------------------------+
答案 1 :(得分:1)
您可以通过SparkSession获得它。
val df = spark.read.json("/path/to/json/file/test.json")
这里spark是SparkSession对象