我有以下DataFrame df
:
customer_id product_id timestamp action
111 1 1519030817 add
111 1 1519030917 remove
111 2 1519030819 add
222 2 1519030819 add
我希望按customer_id
和product_id
对记录进行分组,然后采取最后一项措施。
这就是我所做的:
df.groupBy("customer_id","product_id").orderBy(desc("timestamp"))
但我怎么能真正采取最新行动?
结果应如下:
customer_id product_id timestamp action
111 1 1519030917 remove
111 2 1519030819 add
222 2 1519030819 add
答案 0 :(得分:0)
您可以使用Window
功能,如下所示
val w = Window.partitionBy("customer_id", "product_id")
.orderBy(desc("timestamp"), desc("action"))
df.withColumn("rn", row_number().over(w))
.where($"rn" === 1).drop("rn") show (false)
输出:
+-----------+----------+----------+------+
|customer_id|product_id|timestamp |action|
+-----------+----------+----------+------+
|111 |2 |1519030819|add |
|222 |2 |1519030819|add |
|111 |1 |1519030917|remove|
+-----------+----------+----------+------+