Spark结构化流下降重复项保持最后

时间:2018-12-24 09:36:57

标签: apache-spark spark-streaming

我想维护一个获取“更新”的流数据帧。

为此,我将使用dropDuplicates

但是dropDuplicates放弃最新更改。

我如何只保留最后一个?

1 个答案:

答案 0 :(得分:0)

假设您需要通过删除其他重复项来选择id列上的最后一条记录,则可以使用窗口函数并根据row_number = count进行过滤。检查一下

scala> val df = Seq((120,34.56,"2018-10-11"),(120,65.73,"2018-10-14"),(120,39.96,"2018-10-20"),(122,11.56,"2018-11-20"),(122,24.56,"2018-10-20")).toDF("id","amt","dt")
df: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]

scala> val df2=df.withColumn("dt",'dt.cast("date"))
df2: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]

scala> df2.show(false)
+---+-----+----------+
|id |amt  |dt        |
+---+-----+----------+
|120|34.56|2018-10-11|
|120|65.73|2018-10-14|
|120|39.96|2018-10-20|
|122|11.56|2018-11-20|
|122|24.56|2018-10-20|
+---+-----+----------+


scala> df2.createOrReplaceTempView("ido")

scala> spark.sql(""" select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido """).show(false)
+---+-----+----------+---+---+
|id |amt  |dt        |rw |cw |
+---+-----+----------+---+---+
|122|24.56|2018-10-20|1  |2  |
|122|11.56|2018-11-20|2  |2  |
|120|34.56|2018-10-11|1  |3  |
|120|65.73|2018-10-14|2  |3  |
|120|39.96|2018-10-20|3  |3  |
+---+-----+----------+---+---+


scala> spark.sql(""" select id,amt,dt from (select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido) where rw=cw """).show(false)
+---+-----+----------+
|id |amt  |dt        |
+---+-----+----------+
|122|11.56|2018-11-20|
|120|39.96|2018-10-20|
+---+-----+----------+


scala>

如果要对dt进行降序排序,只需在over(0子句中给出“按dt desc排序”即可。这有帮助吗?