Spark DataFrame - 使用逻辑删除行

时间:2018-04-09 22:35:49

标签: scala apache-spark functional-programming spark-dataframe

我需要帮助解决下一个案例:

我有下一个DataFrame enter image description here

我需要删除CNTA_TIPODOCUMENTOS和CNTA_NRODOCUMENTO重复的行,并按最后CNTA_FECHA_FORMULARIO排序,例如CNTA_NRODOCUMENTO 35468731。我应该得到这一行。

|                  1|         35468731| 2012-08-25 00:00:...|              MARIA| 

你有什么想法吗? 感谢

1 个答案:

答案 0 :(得分:1)

一种方法是使用Window函数row_number按降序排列正确分区的日期,并选择每个分区的第一行:

val df = Seq(
  (1, 80025709, "2010-07-19 00:00:00", "JUAN"),
  (1, 35468731, "2010-07-28 00:00:00", "PEDRO"),
  (1, 51714038, "2010-08-02 00:00:00", "ALEX"),
  (1, 35468731, "2011-09-28 00:00:00", "KAREN"),
  (1, 35468731, "2012-08-25 00:00:00", "MARIA")
).toDF("c1", "c2", "date", "name")

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

df.withColumn(
    "rownum",
    row_number.over(Window.partitionBy($"c1", $"c2").orderBy($"date".desc))
  ).
  select($"c1", $"c2", $"date", $"name").
  where($"rownum" === 1).
  show

// +---+--------+-------------------+-----+
// | c1|      c2|               date| name|
// +---+--------+-------------------+-----+
// |  1|51714038|2010-08-02 00:00:00| ALEX|
// |  1|80025709|2010-07-19 00:00:00| JUAN|
// |  1|35468731|2012-08-25 00:00:00|MARIA|
// +---+--------+-------------------+-----+