如何在Presto中进行重复数据删除

时间:2018-08-01 09:46:13

标签: sql presto

我有一个Presto表,假设它具有[id,name,update_time]列和数据

(1, Amy, 2018-08-01),
(1, Amy, 2018-08-02),
(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)

现在,我要执行一个sql,结果将是

(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)

目前,我在Presto中进行重复数据删除的最佳方法如下。

select 
    t1.id, 
    t1.name,
    t1.update_time 
from table_name t1
join (select id, max(update_time) as update_time from table_name group by id) t2
    on t1.id = t2.id and t1.update_time = t2.update_time

更多信息,例如deduplication in sql

在Presto中是否有更好的重复数据删除方法?

5 个答案:

答案 0 :(得分:2)

在PrestoDB中,我倾向于使用row_number()

select id, name, date
from (select t.*,
             row_number() over (partition by name order by date desc) as seqnum
      from table_name t
     ) t
where seqnum = 1;

答案 1 :(得分:1)

您似乎想要Array ( [0] => Deoghar [1] => Dhanbad ) Array ( [0] => Deoghar [1] => Dhanbad ) Array ( [0] => Deoghar [1] => Dhanbad ) Array ( [0] => Deoghar [1] => Dhanbad ) Array ( [0] => Deoghar [1] => Dhanbad ) Array ( [0] => Deoghar [1] => Dhanbad )

subquery

答案 2 :(得分:1)

这是另一种方式

WITH latestDate AS (SELECT id,max(date) as latestDate FROM table_name GROUP BY id)
    SELECT id,name,date FROM table_name t INNER JOIN latestDate l ON t.id = l.id AND t.date = l.latestDate

答案 3 :(得分:0)

很简单:

Select id, name, MAX(update_time) as [Last Update] from table_name Group by id

希望有帮助

答案 4 :(得分:0)

只需使用in运算符

 select t.*
    from tableA t
    where update_time in (select MAX(tableA.update_time) from tableA goup by id)