在Dataframe中创建具有嵌套列表聚合的列

时间:2018-02-27 17:28:24

标签: scala apache-spark apache-spark-sql

我有这个结构的数据框

val df = Seq(
  ("john", "tomato", 1),
  ("john", "carrot", 4),
  ("bill", "apple", 1),
  ("john", "tomato", 2),
  ("bill", "taco", 2)      
).toDF("name", "food", "price")

我需要使嵌套列表聚合,就像这样

name | acc                       |
-----+---------------------------+
john |[(tomato, 3), (carrot, 4)] |
bill |[(apple, 1), (taco,2 )]    |

我试过这种方式,但这不对。

 dff.groupBy($"name")
  .agg(collect_list(struct($"food", $"price")).as("foods"))
  .show(false)
+----+------------------------------------+
|name|set                                 |
+----+------------------------------------+
|john|[[tomato,1], [carrot,4], [tomato,2]]|
|bill|[[apple,1], [taco,2]]               |
+----+------------------------------------+

我怎么能得到它?

1 个答案:

答案 0 :(得分:0)

您需要两个groupByaggregations并使用collect_liststructsum 内置函数

import org.apache.spark.sql.functions._
df.groupBy("name", "food").agg(sum("price").as("price"))
  .groupBy("name").agg(collect_list(struct("food", "price")).as("acc"))

您将输出dataframe作为

+----+------------------------+
|name|acc                     |
+----+------------------------+
|john|[[carrot,4], [tomato,3]]|
|bill|[[taco,2], [apple,1]]   |
+----+------------------------+