我有这个结构的数据框
val df = Seq(
("john", "tomato", 1),
("john", "carrot", 4),
("bill", "apple", 1),
("john", "tomato", 2),
("bill", "taco", 2)
).toDF("name", "food", "price")
我需要使嵌套列表聚合,就像这样
name | acc |
-----+---------------------------+
john |[(tomato, 3), (carrot, 4)] |
bill |[(apple, 1), (taco,2 )] |
我试过这种方式,但这不对。
dff.groupBy($"name")
.agg(collect_list(struct($"food", $"price")).as("foods"))
.show(false)
+----+------------------------------------+
|name|set |
+----+------------------------------------+
|john|[[tomato,1], [carrot,4], [tomato,2]]|
|bill|[[apple,1], [taco,2]] |
+----+------------------------------------+
我怎么能得到它?
答案 0 :(得分:0)
您需要两个groupBy
和aggregations
并使用collect_list
,struct
和sum
内置函数
import org.apache.spark.sql.functions._
df.groupBy("name", "food").agg(sum("price").as("price"))
.groupBy("name").agg(collect_list(struct("food", "price")).as("acc"))
您将输出dataframe
作为
+----+------------------------+
|name|acc |
+----+------------------------+
|john|[[carrot,4], [tomato,3]]|
|bill|[[taco,2], [apple,1]] |
+----+------------------------+