我在Spark中有以下DataFrame df
:
+------------+---------+-----------+
|OrderID | Type| Qty|
+------------+---------+-----------+
| 571936| 62800| 1|
| 571936| 62800| 1|
| 571936| 62802| 3|
| 661455| 72800| 1|
| 661455| 72801| 1|
我需要选择每个唯一的Qty
具有最大OrderID
值的行,或者如果所有OrderID
都相同(例如,每个Qty
的最后一行)至于661455
)。预期结果:
+------------+---------+-----------+
|OrderID | Type| Qty|
+------------+---------+-----------+
| 571936| 62802| 3|
| 661455| 72801| 1|
任何人都想得到它吗?
这是我尝试过的:
val partitionWindow = Window.partitionBy(col("OrderID")).orderBy(col("Qty").asc)
val result = df.over(partitionWindow)
答案 0 :(得分:0)
scala> val w = Window.partitionBy("OrderID").orderBy("Qty")
scala> val w1 = Window.partitionBy("OrderID")
scala> df.show()
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 571936|62800| 1|
| 571936|62800| 1|
| 571936|62802| 3|
| 661455|72800| 1|
| 661455|72801| 1|
+-------+-----+---+
scala> df.withColumn("rn", row_number.over(w)).withColumn("mxrn", max("rn").over(w1)).filter($"mxrn" === $"rn").drop("mxrn","rn").show
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 661455|72801| 1|
| 571936|62802| 3|
+-------+-----+---+