如何选择用户的最后一个操作

时间:2018-02-20 15:16:41

标签: scala apache-spark spark-dataframe

我有以下DataFrame df

customer_id   product_id   timestamp   action
111           1            1519030817  add
111           1            1519030917  remove
111           2            1519030819  add
222           2            1519030819  add

我希望按customer_idproduct_id对记录进行分组,然后采取最后一项措施。

这就是我所做的:

df.groupBy("customer_id","product_id").orderBy(desc("timestamp"))

但我怎么能真正采取最新行动?

结果应如下:

customer_id   product_id   timestamp   action
111           1            1519030917  remove
111           2            1519030819  add
222           2            1519030819  add

1 个答案:

答案 0 :(得分:0)

您可以使用Window功能,如下所示

  val w = Window.partitionBy("customer_id", "product_id")
          .orderBy(desc("timestamp"), desc("action"))

  df.withColumn("rn", row_number().over(w))
          .where($"rn" === 1).drop("rn") show (false)

输出:

+-----------+----------+----------+------+
|customer_id|product_id|timestamp |action|
+-----------+----------+----------+------+
|111        |2         |1519030819|add   |
|222        |2         |1519030819|add   |
|111        |1         |1519030917|remove|
+-----------+----------+----------+------+