在数据框spark中查找属性组合的前n个元素

时间:2017-09-05 15:02:51

标签: apache-spark apache-spark-sql

我有一个如下数据框。

    scala> ds.show
    +----+----------+----------+-----+
    | key|attribute1|attribute2|value|
    +----+----------+----------+-----+
    |mac1|        A1|        B1|   10|
    |mac2|        A2|        B1|   10|
    |mac3|        A2|        B1|   10|
    |mac1|        A1|        B2|   10|
    |mac1|        A1|        B2|   10|
    |mac3|        A1|        B1|   10|
    |mac2|        A2|        B1|   10|
    +----+----------+----------+-----+

对于attribute1中的每个值,我想找到前N个键和该键的聚合值。 输出: attribute1的键的聚合值将为

    +----+----------+-----+
    | key|attribute1|value|
    +----+----------+-----+
    |mac1|        A1|   30|
    |mac2|        A2|   20|
    |mac3|        A2|   10|
    |mac3|        A1|   10|
    +----+----------+-----+

现在如果N = 1则输出为A1 - (mac1,30)A2-(mac2,20)

如何在DataFrame / Dataset中实现此目的? 我想为所有属性实现这一点。在上面的例子中,我想找到attribute1和attribute2。

1 个答案:

答案 0 :(得分:1)

将输入dataframe视为

+----+----------+----------+-----+
|key |attribute1|attribute2|value|
+----+----------+----------+-----+
|mac1|A1        |B1        |10   |
|mac2|A2        |B1        |10   |
|mac3|A2        |B1        |10   |
|mac1|A1        |B2        |10   |
|mac1|A1        |B2        |10   |
|mac3|A1        |B1        |10   |
|mac2|A2        |B1        |10   |
+----+----------+----------+-----+

并对上述输入aggregation执行dataframe

import org.apache.spark.sql.functions._
val groupeddf = df.groupBy("key", "attribute1").agg(sum("value").as("value"))

应该给你

+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac1|A1        |30.0 |
|mac3|A1        |10.0 |
|mac3|A2        |10.0 |
|mac2|A2        |20.0 |
+----+----------+-----+

现在,您可以使用Window函数为分组数据中的每一行生成排名,并使用filter生成rank <= N

val N = 1

val windowSpec = Window.partitionBy("attribute1").orderBy($"value".desc)

groupeddf.withColumn("rank", rank().over(windowSpec))
  .filter($"rank" <= N)
  .drop("rank")

应该为您提供所需的dataframe

+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac2|A2        |20.0 |
|mac1|A1        |30.0 |
+----+----------+-----+