Question

数据框A_df，如：

+------+----+-----+
|   uid|year|month|
+------+----+-----+
|     1|2017|   03|
      1|2017|   05|
|     2|2017|   01|
|     3|2017|   02|
|     3|2017|   04|
|     3|2017|   05|
+------+----+-----+

我希望过滤列uid的出现时间超过2次，预期结果：

+------+----+-----+
|   uid|year|month|
+------+----+-----+
|     3|2017|   02|
|     3|2017|   04|
|     3|2017|   05|
+------+----+-----+

如何通过Scala获得此结果？我的解决方案：

val condition_uid = A_df.groupBy("uid")
                  .agg(count("*").alias("cnt"))
                  .filter("cnt > 2").select("uid")
val results_df = A_df.join(condition_uid, Seq("uid"))

有更好的答案吗？

Answer 1

我认为使用窗口函数是完美的解决方案，因为您不必重新加入数据帧。

val window = Window.partitionBy("uid").orderBy("year")

df.withColumn("count", count("uid").over(window))
  .filter($"count" > 2).drop("count").show

输出：

+---+----+-----+-----+
|uid|year|month|count|
+---+----+-----+-----+
|  1|2017|   03|    2|
|  1|2017|   05|    2|
|  2|2017|   01|    1|
+---+----+-----+-----+

具有groupby计数的Spark Filter数据

1 个答案: