如果我有这样的数据集:
name | food | drink | dollars
==================================
John | salad | water | 1
Dave | salad | soda | 2
John | burger | water | 5
John | burger | soda | 1
如何在Spark(Scala)中获得以下结果:
name | food_count | drink_count | total_dollars
==========================================================================
John | [(salad, 1), (burger, 2)] | [(water, 2), (soda, 1)] | 7
Dave | [(salad, 1)] | [(soda, 1)] | 2
不确定在groupBy("name")
之后要应用哪个聚合函数。
我需要写一个UADF吗?
感觉这是一个足够普遍的问题,所以我希望有一个使用内置函数的解决方案。
答案 0 :(得分:2)
在Spark中,您可以尝试使用Multi-Dimensional Aggregation cube
个计数,然后使用collect_list
:
scala> var df =Seq(("John" , "salad" , "water", 1),("Dave" , "salad" , "soda" , 2),("John" , "burger" , "water", 5),("John" , "burger" , "soda" , 1)).toDF("name","food","drink","dollar")
scala> var testing = df.cube("name", "food","drink").count()
scala> var drinks_df =testing.filter(col("food").isNotNull).groupBy("name","drink").agg(struct(col("drink"),sum("count")).as("drink_count")).na.drop.groupBy("name").agg(collect_list("drink_count").as("drink_count")).join(df.groupBy("name").agg(sum("dollar").as("dollars_sum")),Seq("name"),"left").withColumnRenamed("name","name1")
scala> var fooddf1 = testing.filter(col("drink").isNotNull).groupBy("name","food").agg(struct(col("food"),sum("count")).as("food_count")).na.drop.groupBy("name").agg(collect_list("food_count").as("food_count"))
scala> fooddf1.join(drinks_df,col("name1")===col("name"),"left").drop("name1").show(false)
+----+-------------------------+-----------------------+-----------+
|name|food_count |drink_count |dollars_sum|
+----+-------------------------+-----------------------+-----------+
|Dave|[[salad, 1]] |[[soda, 1]] |2 |
|John|[[salad, 1], [burger, 2]]|[[water, 2], [soda, 1]]|7 |
+----+-------------------------+-----------------------+-----------+
借助多维功能,您可以计算具有不同组的行