我有一个带有两列作为“键”的DataFrame:id1
和id2
:
val df1 = Seq(
(1, 11, "n1", "d1"),
(1, 22, "n2", "d2"),
(2, 11, "n3", "d3"),
(2, 11, "n4", "d4")
).toDF("id1", "id2", "number", "data")
scala> df1.show
+---+---+------+----+
|id1|id2|number|data|
+---+---+------+----+
| 1| 11| n1| d1|
| 1| 22| n2| d2|
| 2| 11| n3| d3|
| 2| 11| n4| d4|
+---+---+------+----+
我想要按数据帧的键分组的Json,如下所示:
+---+---+-------+----------------------------------------------------------+
|id1|id2| json |
+---+---+-------+----------------------------------------------------------+
| 1| 11|[{"number" : "n1", "data": "d1"}] |
| 1| 22|[{"number" : "n2", "data": "d2"}] |
| 2| 11|[{"number" : "n3", "data": "d3"}, {"number" : "n4", "data": "d4"}]|
+---+---+-------+----------------------------------------------------------+
版本:
Spark: 2.2
Scala: 2.11
答案 0 :(得分:5)
这可以通过首先使用to_json
将number
和data
列转换为json格式来完成。然后在两个id列上将groupBy
与collect_list
一起使用,以获取所需的结果。
val df2 = df1.withColumn("json", to_json(struct($"number", $"data")))
.groupBy("id1", "id2").agg(collect_list($"json"))