如何在Spark 1.6中使用窗口聚合中的collect_set和collect_list函数?

时间:2017-07-16 17:27:37

标签: scala apache-spark apache-spark-sql apache-spark-1.6

在Spark 1.6.0 / Scala中,是否有机会获得collect_list("colC")collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")

2 个答案:

答案 0 :(得分:20)

鉴于你有dataframe

+----+----+----+
|colA|colB|colC|
+----+----+----+
|1   |1   |23  |
|1   |2   |63  |
|1   |3   |31  |
|2   |1   |32  |
|2   |2   |56  |
+----+----+----+

您可以Window执行以下操作

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)

结果:

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[23, 63]    |
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

同样是collect_set的结果。但最终set中元素的顺序与collect_list

的顺序不一致
df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[63, 23]    |
|1   |3   |31  |[63, 31, 23]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[56, 32]    |
+----+----+----+------------+

如果您删除orderBy如下

df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)

结果将是

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23, 63, 31]|
|1   |2   |63  |[23, 63, 31]|
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32, 56]    |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

我希望答案很有帮助

答案 1 :(得分:0)

现有答案是有效的,只需在此处添加不同风格的窗口函数编写:

import org.apache.spark.sql.expressions.Window

val wind_user = Window.partitionBy("colA", "colA2").orderBy("colB", "colB2".desc)

df.withColumn("colD_distinct", collect_set($"colC") over wind_user)
.withColumn("colD_historical", collect_list($"colC") over wind_user).show(false)