SQL:在同一表格中(动态)查找重复计数,添加的新值和删除的值

时间:2018-08-16 02:27:34

标签: mysql sql sql-server

我希望使用 SQL 完成以下目标:

1)查找重复记录的数量
根据列(“快照日期”)提取重复值的数量,并将其与上一个日期进行比较
2)查找添加的记录数
3)查找已删除的记录数

请参见下面的示例表:

当前表

snapshot_date | unique ID
 2018-08-15        1
 2018-08-15        2
 2018-08-15        3
 2018-08-15        4
 2018-08-15        5

 2018-08-16        1
 2018-08-16        3
 2018-08-16        4
 2018-08-16        6
 2018-08-16        7
 2018-08-16        8
 2018-08-16        9

 2018-08-17        3
 2018-08-17        8
 2018-08-17        10
 2018-08-17        11
 2018-08-17        12
 2018-08-17        13

所需表

snapshot date | count | # of dupe from previous date | sum of ID added | sum of ID removed
 2018-08-15       5                 N/A                     N/A                  N/A 
 2018-08-16       7                  3                       4                    2
 2018-08-17       6                  2                       4                    5

如果有人知道脚本可以到达所需的表格,我将非常感激!提前谢谢你们!

2 个答案:

答案 0 :(得分:3)

如果使用的MySQL(至少在较早版本中不支持分析功能LEAD和LAG),则一种方法是进行一系列自联接,然后进行聚合以获取所需的结果:< / p>

SELECT
    t1.snapshot_date,
    t1.count,
    t1.previous_dupe,
    t1.num_added,
    t2.num_subtracted
FROM
(
    SELECT
        t1.snapshot_date,
        COUNT(*) AS count,
        COUNT(t2.snapshot_date) AS previous_dupe,
        COUNT(CASE WHEN t2.snapshot_date IS NULL THEN 1 END) AS num_added
    FROM yourTable t1
    LEFT JOIN yourTable t2
        ON t1.snapshot_date = DATE_ADD(t2.snapshot_date, INTERVAL 1 DAY) AND
           t1.uniqueID = t2.uniqueID
    GROUP BY t1.snapshot_date
) t1
LEFT JOIN
(
    SELECT
        DATE_ADD(t1.snapshot_date, INTERVAL 1 DAY) AS snapshot_date,
        COUNT(CASE WHEN t2.snapshot_date IS NULL THEN 1 END) AS num_subtracted
    FROM yourTable t1
    LEFT JOIN yourTable t2
        ON t1.snapshot_date = DATE_SUB(t2.snapshot_date, INTERVAL 1 DAY) AND
           t1.uniqueID = t2.uniqueID
    GROUP BY t1.snapshot_date
) t2
    ON t1.snapshot_date = t2.snapshot_date;

enter image description here

Demo

注意:我的结果与期望的结果之间存在细微差异,部分原因是您自己的数学错误,部分是由于查询中逻辑的工作方式。我报告最早在记录中添加了5个新ID,因为从概念上讲没有更早的记录,并且所有5个值在技术上都是新的。

这个问题特别难看,因为我们需要在两个单独的子查询中以不同的方向自我连接两次。

答案 1 :(得分:3)

这是我的看法。基于SQL Server

SELECT  snapshot_date       = COALESCE(c.snapshot_date, DATEADD(day, 1, p.snapshot_date)),
        [count]             = COUNT(c.snapshot_date),
        dup_from_prev_day   = SUM(CASE WHEN c.snapshot_date is not null 
                                       AND  p.snapshot_date is not null 
                                       THEN 1 END),
        sum_of_id_added     = SUM(CASE WHEN c.snapshot_date is not null 
                                       AND  p.snapshot_date is null 
                                       THEN 1 END),
        sum_of_id_removed   = SUM(CASE WHEN c.snapshot_date is null 
                                       AND  p.snapshot_date is not null 
                                       THEN 1 END)
FROM    yourTable c         -- current
        FULL OUTER JOIN yourTable p -- previous
        ON  c.snapshot_date     = DATEADD(DAY, 1, p.snapshot_date)
        AND c.uniqueID          = p.uniqueID
GROUP BY COALESCE(c.snapshot_date, DATEADD(DAY, 1, p.snapshot_date))
HAVING COUNT(c.snapshot_date) > 0

/* RESULT : 
snapshot_date  count  dup_from_prev_day  sum_of_id_added  sum_of_id_removed
2018-08-15     5      NULL               5                NULL
2018-08-16     7      3                  4                2
2018-08-17     6      2                  4                5
*/