带有ID列的重复电子邮件地址

时间:2015-04-02 01:13:23

标签: sql email duplicate-detection

我的表格包含重复的电子邮件地址。每个电子邮件地址都有唯一的创建日期和唯一ID。我想确定具有最新创建日期及其关联ID的电子邮件地址,并显示重复ID及其创建日期。我希望查询以下列格式显示:

  • 第1栏:EmailAddress
  • 第2栏:IDKeep
  • 第3列:CreateDateofIDKeep
  • 第4列:DuplicateID
  • 第5列:CreateDateofDuplicateID

注意:有些情况下存在超过2个重复的电子邮件地址。我希望查询在新行上显示每个附加副本,在这些实例中重新说明EmailAddress和IDKeep。

无济于事我试图拼凑在这里找到的不同查询。我目前处于亏损状态 - 任何帮助/方向都会受到高度赞赏。

2 个答案:

答案 0 :(得分:1)

复杂的查询最好通过将其分解为多个部分并逐步完成来解决。

首先让我们通过查找每封电子邮件的最新创建日期然后加入以获取ID来创建查询以查找我们要保留的行的键:

select x.Email, x.CreateDate, x.Id
from myTable x
join (
    select Email, max(CreateDate) as CreateDate
    from myTable
    group by Email
) y on x.Email = y.Email and x.CreateDate = y.CreateDate

好的,现在让我们进行查询以获取重复的电子邮件地址:

select Email
from myTable
group by Email
having count(*) > 1

将此查询加回到表中以获取具有重复项的每一行的键:

select x.Email, x.Id, x.CreateDate
from myTable x
join (
    select Email
    from myTable
    group by Email
    having count(*) > 1
) y on x.Email = y.Email

大。现在剩下的就是将第一个查询与这个查询结合起来得到我们的结果:

select keep.Email, keep.Id as IdKeep, keep.CreateDate as CreateDateOfIdKeep,
    dup.Id as DuplicateId, dup.CreateDate as CreateDateOfDuplicateId
from (
    select x.Email, x.CreateDate, x.Id
    from myTable x
    join (
        select Email, max(CreateDate) as CreateDate
        from myTable
        group by Email
    ) y on x.Email = y.Email and x.CreateDate = y.CreateDate
) keep
join (
    select x.Email, x.Id, x.CreateDate
    from myTable x
    join (
        select Email
        from myTable
        group by Email
        having count(*) > 1
    ) y on x.Email = y.Email
) dup on keep.Email = dup.Email and keep.Id <> dup.Id

请注意,加入时的最终keep.Id <> dup.Id谓词可确保我们不会为keepdup获取相同的行。

答案 1 :(得分:0)

以下子查询使用技巧获取每封电子邮件的最新ID和创建日期:

select Email, max(CreateDate) as CreateDate,
       substring_index(group_concat(id order by CreateDate desc), ',', 1) as id
from myTable
group by Email
having count(*) > 1;

having()子句还确保这仅适用于重复的电子邮件。

然后,只需要将此查询与其余数据组合以获得所需的格式:

select t.Email, tkeep.id as keep_id, tkeep.CreateDate as keep_date,
       id as dup_id, CreateDate as dup_CreateDate
from myTable t join
     (select Email, max(CreateDate) as CreateDate,
             substring_index(group_concat(id order by CreateDate desc), ',', 1) as id
      from myTable
      group by Email
      having count(*) > 1
     ) tkeep
     on t.Email = tkeep.Email and t.CreateDate <> tkeep.CreateDate;