Question

我有一个用例，我需要删除数据帧的重复行（在这种情况下，重复意味着它们具有相同的＆＃39; id＆＃39;字段），同时保持行具有最高的＆＃39;时间戳＆＃39; （unix timestamp）字段。

我找到了drop_duplicate方法（我使用了pyspark），但是一个人无法控制将保留哪个项目。

任何人都可以提供帮助？ Thx提前

Answer 1

可能需要手动地图和缩小来提供您想要的功能。

def selectRowByTimeStamp(x,y):
    if x.timestamp > y.timestamp:
        return x
    return y

dataMap = data.map(lambda x: (x.id, x))
uniqueData = dataMap.reduceByKey(selectRowByTimeStamp)

这里我们根据id对所有数据进行分组。然后，当我们减少分组时，我们通过保留具有最高时间戳的记录来实现。当代码完成reduce时，每个id只剩下1条记录。

Answer 2

您可以这样做：

val df = Seq(
  (1,12345678,"this is a test"),
  (1,23456789, "another test"),
  (2,2345678,"2nd test"),
  (2,1234567, "2nd another test")
).toDF("id","timestamp","data")

+---+---------+----------------+
| id|timestamp|            data|
+---+---------+----------------+
|  1| 12345678|  this is a test|
|  1| 23456789|    another test|
|  2|  2345678|        2nd test|
|  2|  1234567|2nd another test|
+---+---------+----------------+

df.join(
  df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"),
  $"id" === $"r_id" && $"timestamp" === $"r_timestamp"
).drop("r_id").drop("r_timestamp").show
+---+---------+------------+
| id|timestamp|        data|
+---+---------+------------+
|  1| 23456789|another test|
|  2|  2345678|    2nd test|
+---+---------+------------+

如果您预计timestamp可能会重复id（请参阅下面的评论），您可以这样做：

df.dropDuplicates(Seq("id", "timestamp")).join(
  df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"),
  $"id" === $"r_id" && $"timestamp" === $"r_timestamp"
).drop("r_id").drop("r_timestamp").show

spark：如何在保持最高时间戳行的同时对数据帧执行dropDuplicates

2 个答案: