按聚合计数划分的窗口

时间:2019-03-20 16:03:54

标签: apache-spark apache-spark-sql

我想对一个窗口进行计数。聚合的计数结果应存储在新列中:

输入数据框:

    val df = Seq(("N1", "M1","1"),("N1", "M1","2"),("N1", "M2","3")).toDF("NetworkID", "Station","value")

    +---------+-------+-----+
    |NetworkID|Station|value|
    +---------+-------+-----+
    |       N1|     M1|    1|
    |       N1|     M1|    2|
    |       N1|     M2|    3|
    +---------+-------+-----+

    val w = Window.partitionBy(df("NetworkID"))

到目前为止的结果:

        df.withColumn("count", count("Station").over(w)).show()
        +---------+-------+-----+-----+
        |NetworkID|Station|value|count|
        +---------+-------+-----+-----+
        |       N1|     M2|    3|    3|
        |       N1|     M1|    1|    3|
        |       N1|     M1|    2|    3|
        +---------+-------+-----+-----+

我想要的结果:

+---------+-------+-----+-----+

|NetworkID|Station|value|count|

+---------+-------+-----+-----+

|       N1|     M2|    3|    2|

|       N1|     M1|    1|    2|

|       N1|     M1|    2|    2|

+---------+-------+-----+-----+

因为NetworkID N1的站数等于2(M1和M2)。

我知道我可以通过创建一个新的数据框,选择2列NetworkID和Station并进行groupBy并与第一个连接来做到这一点。

但是我要对数据框的不同列进行大量汇总操作,因此必须避免联接。

预先感谢

3 个答案:

答案 0 :(得分:0)

您还需要在“ Station”列上进行分区,因为您要为每个NetworkID计数Stations。

scala> val df = Seq(("N1", "M1","1"),("N1", "M1","2"),("N1", "M2","3"),("N2", "M1", "4"), ("N2", "M2", "2")).toDF("NetworkID", "Station", "value")
df: org.apache.spark.sql.DataFrame = [NetworkID: string, Station: string ... 1 more field]

scala> val w = Window.partitionBy("NetworkID", "Station")
w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@5b481d77

scala> df.withColumn("count", count("Station").over(w)).show()
+---------+-------+-----+-----+
|NetworkID|Station|value|count|
+---------+-------+-----+-----+
|       N2|     M2|    2|    1|
|       N1|     M2|    3|    1|
|       N2|     M1|    4|    1|
|       N1|     M1|    1|    2|
|       N1|     M1|    2|    2|
+---------+-------+-----+-----+

答案 1 :(得分:0)

您想要的是“站点”列的不同计数,该列可以表示为countDistinct("Station")而不是count("Station")。不幸的是,尚不支持(仅在我的火花中??)。

org.apache.spark.sql.AnalysisException: Distinct window functions are not supported

作为一项调整,您可以同时使用dense_rank和后退。

df.withColumn("count", (dense_rank() over w.orderBy(asc("Station"))) + (dense_rank() over w.orderBy(desc("Station"))) - 1).show()

+---------+-------+-----+-----+
|NetworkID|Station|value|count|
+---------+-------+-----+-----+
|       N1|     M1|    2|    2|
|       N1|     M1|    1|    2|
|       N1|     M2|    3|    2|
+---------+-------+-----+-----+

答案 2 :(得分:0)

我知道已经晚了,但试试这个:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val result = df
  .withColumn("dr", dense_rank().over(Window.partitionBy("NetworkID").orderBy("Station")))
  .withColumn("count", max("dr").over(Window.partitionBy("NetworkID")))