Question

我有这个结构的数据框

val df = Seq(
  ("john", "tomato", 1),
  ("john", "carrot", 4),
  ("bill", "apple", 1),
  ("john", "tomato", 2),
  ("bill", "taco", 2)      
).toDF("name", "food", "price")

我需要使嵌套列表聚合，就像这样

name | acc                       |
-----+---------------------------+
john |[(tomato, 3), (carrot, 4)] |
bill |[(apple, 1), (taco,2 )]    |

我试过这种方式，但这不对。

 dff.groupBy($"name")
  .agg(collect_list(struct($"food", $"price")).as("foods"))
  .show(false)

+----+------------------------------------+
|name|set                                 |
+----+------------------------------------+
|john|[[tomato,1], [carrot,4], [tomato,2]]|
|bill|[[apple,1], [taco,2]]               |
+----+------------------------------------+

我怎么能得到它？

Answer 1

您需要两个groupBy和aggregations并使用collect_list，struct和sum 内置函数

import org.apache.spark.sql.functions._
df.groupBy("name", "food").agg(sum("price").as("price"))
  .groupBy("name").agg(collect_list(struct("food", "price")).as("acc"))

您将输出dataframe作为

+----+------------------------+
|name|acc                     |
+----+------------------------+
|john|[[carrot,4], [tomato,3]]|
|bill|[[taco,2], [apple,1]]   |
+----+------------------------+

在Dataframe中创建具有嵌套列表聚合的列

1 个答案: