Question

我有一百万行具有重复的记录。我只想保留最新时间戳的记录。

df.sort($"event_timestamp".desc).dropDuplicate("uniquekey")

下面是格式化的执行计划：

== Physical Plan ==
      :     :     :     :- Project [CreatedAt#49148,]
      :     :     :     :  +- SortAggregate(key=[unique_key#42606], functions=[first(CreatedAt#42605, false),,... 71 more fields], output=[])
      :     :     :     :     +- Sort [unique_key#42606 ASC], false, 0
      :     :     :     :        +- Exchange hashpartitioning(unique_key#42606, 20)
      :     :     :     :           +- SortAggregate(key=[unique_key#42606], functions=[partial_first(CreatedAt#42605, false).. ], output=[])
      :     :     :     :              +- *Sort [unique_key#42606 ASC], false, 0
      :     :     :     :                 +- *Sort [unique_key#42606 DESC, event_timestamp#48819 DESC], true, 0
      :     :     :     :                    +- Exchange rangepartitioning( event_timestamp#48819 DESC, 20)
      :     :     :     :                       +- *Project []
      :     :     :     :                          +- InMemoryTableScan [CreatedAt#42539, Data#42540, Dataset#42541, _id#42542]
      :     :     :     :                             :  +- InMemoryRelation [CreatedAt#42539, Data#42540, Dataset#42541, _id#42542], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      :     :     :     :                             :     :  +- *Filter ()
      :     :     :     :                             :     :     +- *Scan MongoRelation(MongoRDD[565] at RDD at MongoRDD.scala:52,Some(StructType(...))

我得到了错误的输出，其中重复数据删除的行并不总是具有延迟时间戳。

还有其他方法可以正确进行重复数据删除吗？

具有排序条件的火花行的重复数据删除问题

0 个答案: