我想维护一个获取“更新”的流数据帧。
为此,我将使用dropDuplicates
。
但是dropDuplicates
放弃最新更改。
我如何只保留最后一个?
答案 0 :(得分:0)
假设您需要通过删除其他重复项来选择id列上的最后一条记录,则可以使用窗口函数并根据row_number = count进行过滤。检查一下
scala> val df = Seq((120,34.56,"2018-10-11"),(120,65.73,"2018-10-14"),(120,39.96,"2018-10-20"),(122,11.56,"2018-11-20"),(122,24.56,"2018-10-20")).toDF("id","amt","dt")
df: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]
scala> val df2=df.withColumn("dt",'dt.cast("date"))
df2: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]
scala> df2.show(false)
+---+-----+----------+
|id |amt |dt |
+---+-----+----------+
|120|34.56|2018-10-11|
|120|65.73|2018-10-14|
|120|39.96|2018-10-20|
|122|11.56|2018-11-20|
|122|24.56|2018-10-20|
+---+-----+----------+
scala> df2.createOrReplaceTempView("ido")
scala> spark.sql(""" select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido """).show(false)
+---+-----+----------+---+---+
|id |amt |dt |rw |cw |
+---+-----+----------+---+---+
|122|24.56|2018-10-20|1 |2 |
|122|11.56|2018-11-20|2 |2 |
|120|34.56|2018-10-11|1 |3 |
|120|65.73|2018-10-14|2 |3 |
|120|39.96|2018-10-20|3 |3 |
+---+-----+----------+---+---+
scala> spark.sql(""" select id,amt,dt from (select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido) where rw=cw """).show(false)
+---+-----+----------+
|id |amt |dt |
+---+-----+----------+
|122|11.56|2018-11-20|
|120|39.96|2018-10-20|
+---+-----+----------+
scala>
如果要对dt进行降序排序,只需在over(0子句中给出“按dt desc排序”即可。这有帮助吗?