如何使用窗口功能选择行?

时间:2019-06-20 20:12:11

标签: scala apache-spark apache-spark-sql

我在Spark中有以下DataFrame df

+------------+---------+-----------+
|OrderID     |     Type|        Qty|
+------------+---------+-----------+
|      571936|    62800|          1|
|      571936|    62800|          1|
|      571936|    62802|          3|
|      661455|    72800|          1|
|      661455|    72801|          1|

我需要选择每个唯一的Qty具有最大OrderID值的行,或者如果所有OrderID都相同(例如,每个Qty的最后一行)至于661455)。预期结果:

+------------+---------+-----------+
|OrderID     |     Type|        Qty|
+------------+---------+-----------+
|      571936|    62802|          3|
|      661455|    72801|          1|

任何人都想得到它吗?

这是我尝试过的:

val partitionWindow = Window.partitionBy(col("OrderID")).orderBy(col("Qty").asc)
val result = df.over(partitionWindow)

1 个答案:

答案 0 :(得分:0)

scala> val w = Window.partitionBy("OrderID").orderBy("Qty")
scala> val w1 = Window.partitionBy("OrderID")

scala> df.show()
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 571936|62800|  1|
| 571936|62800|  1|
| 571936|62802|  3|
| 661455|72800|  1|
| 661455|72801|  1|
+-------+-----+---+


scala> df.withColumn("rn",  row_number.over(w)).withColumn("mxrn", max("rn").over(w1)).filter($"mxrn" === $"rn").drop("mxrn","rn").show
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 661455|72801|  1|
| 571936|62802|  3|
+-------+-----+---+