我有一个如下数据框。
scala> ds.show
+----+----------+----------+-----+
| key|attribute1|attribute2|value|
+----+----------+----------+-----+
|mac1| A1| B1| 10|
|mac2| A2| B1| 10|
|mac3| A2| B1| 10|
|mac1| A1| B2| 10|
|mac1| A1| B2| 10|
|mac3| A1| B1| 10|
|mac2| A2| B1| 10|
+----+----------+----------+-----+
对于attribute1中的每个值,我想找到前N个键和该键的聚合值。 输出: attribute1的键的聚合值将为
+----+----------+-----+
| key|attribute1|value|
+----+----------+-----+
|mac1| A1| 30|
|mac2| A2| 20|
|mac3| A2| 10|
|mac3| A1| 10|
+----+----------+-----+
现在如果N = 1则输出为A1 - (mac1,30)A2-(mac2,20)
如何在DataFrame / Dataset中实现此目的? 我想为所有属性实现这一点。在上面的例子中,我想找到attribute1和attribute2。
答案 0 :(得分:1)
将输入dataframe
视为
+----+----------+----------+-----+
|key |attribute1|attribute2|value|
+----+----------+----------+-----+
|mac1|A1 |B1 |10 |
|mac2|A2 |B1 |10 |
|mac3|A2 |B1 |10 |
|mac1|A1 |B2 |10 |
|mac1|A1 |B2 |10 |
|mac3|A1 |B1 |10 |
|mac2|A2 |B1 |10 |
+----+----------+----------+-----+
并对上述输入aggregation
执行dataframe
import org.apache.spark.sql.functions._
val groupeddf = df.groupBy("key", "attribute1").agg(sum("value").as("value"))
应该给你
+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac1|A1 |30.0 |
|mac3|A1 |10.0 |
|mac3|A2 |10.0 |
|mac2|A2 |20.0 |
+----+----------+-----+
现在,您可以使用Window
函数为分组数据中的每一行生成排名,并使用filter
生成rank <= N
行
val N = 1
val windowSpec = Window.partitionBy("attribute1").orderBy($"value".desc)
groupeddf.withColumn("rank", rank().over(windowSpec))
.filter($"rank" <= N)
.drop("rank")
应该为您提供所需的dataframe
。
+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac2|A2 |20.0 |
|mac1|A1 |30.0 |
+----+----------+-----+