Question

我需要找到按年和部门分类的序列集缺失的数字。例如，我在表格中有以下信息：

╔══════╤══════╤═════╗
║ YEAR │ DEPT │ NUM ║
╠══════╪══════╪═════╣
║ 2016 │ 1    │ 1   ║
╟──────┼──────┼─────╢
║ 2016 │ 1    │ 2   ║
╟──────┼──────┼─────╢
║ 2016 │ 1    │ 4   ║
╟──────┼──────┼─────╢
║ 2016 │ 2    │ 10  ║
╟──────┼──────┼─────╢
║ 2016 │ 2    │ 12  ║
╟──────┼──────┼─────╢
║ 2016 │ 2    │ 13  ║
╟──────┼──────┼─────╢
║ 2015 │ 3    │ 6   ║
╟──────┼──────┼─────╢
║ 2015 │ 3    │ 8   ║
╟──────┼──────┼─────╢
║ 2015 │ 3    │ 9   ║
╟──────┼──────┼─────╢
║ 2015 │ 2    │ 24  ║
╟──────┼──────┼─────╢
║ 2015 │ 2    │ 26  ║
╟──────┼──────┼─────╢
║ 2015 │ 2    │ 27  ║
╚══════╧══════╧═════╝

通常情况下，我会LEFT JOIN转到TALLY表格，但我希望保留缺失值所在的YEAR和DEPT。如下所示的方法是我通常会使用什么，但我不知道如何循环丢失值对应的年份和部门，特别是因为MIN和MAX值可能因{而异{1}}和YEAR。

DEPT

我的预期输出如下：

DECLARE @MIN INT = (SELECT MIN(NUM) FROM DOCUMENTS)
DECLARE @MAX INT = (SELECT MAX(NUM) FROM DOCUMENTS)

SELECT
    T.NUM AS 'MISSING'
FROM
    TALLY T
    LEFT JOIN DOCUMENTS D
        ON T.NUM = DOCUMENTS.NUM
WHERE
    D.NUM IS NULL
    AND D.NUM BETWEEN @MIN AND @MAX

我想我可能需要创建一个╔══════╤══════╤═════════════╗ ║ YEAR │ DEPT │ MISSING_NUM ║ ╠══════╪══════╪═════════════╣ ║ 2016 │ 1 │ 3 ║ ╟──────┼──────┼─────────────╢ ║ 2016 │ 2 │ 11 ║ ╟──────┼──────┼─────────────╢ ║ 2015 │ 3 │ 7 ║ ╟──────┼──────┼─────────────╢ ║ 2015 │ 2 │ 25 ║ ╚══════╧══════╧═════════════╝表，其中包含TALLY，YEAR和DEPT列，但每个我将拥有数十亿的价值，因为我有多年的时间来自1800-2016和15个不同的部门，其中NUM范围从1到1亿这些部门。因此，我认为这不是最有效/最实用的方法。

Answer 1

如果只有一个值可能丢失，您可以这样做：

select t.year, t.dept, t.num + 1
from t
where t.num < (select max(t2.num) from t t2 where t2.year = t.year and t2.dept = t.dept) and
      not exists (select 1
                  from t t2
                  where t2.year = t.year and t2.dept = t.dept and
                        t.num + 1 = t2.num
                 );

在SQL Server 2012+中，可以简化为：

select year, dept, num + 1 as num
from (select t.*, lead(num) over (partition by year, dept order by num) as next_num
      from t
     ) t
where next_num <> num + 1;  -- Note:  this handles the final num where `next_num` is `NULL`

这种方法实际上可以推广到找不到的范围。假设您使用的是SQL Server 2012+，那么：

select year, dept, num + 1 as start_missing, next_num - 1 as end_missing
from (select t.*, lead(num) over (partition by year, dept order by num) as next_num
      from t
     ) t
where next_num <> num + 1;  -- Note:  this handles the final num where `next_num` is `NULL`

Answer 2

一种方法是使用递归cte，生成年份和部门组合的最小和最大数量之间的所有数字。此后，left join生成的数字，以找到丢失的数字。

with t1 as (select yr,dept,max(num) maxnum, min(num) minnum 
            from t 
            group by yr,dept)
,x as (select yr, dept, minnum, maxnum from t1
       union all
       select yr, dept, minnum+1, maxnum 
       from x 
       where minnum < maxnum
       )
select x.yr,x.dept,x.minnum as missing_num 
from x  
left join t on t.yr=x.yr and t.dept=x.dept and t.num = x.minnum
where t.num is null 
order by 1,2,3

Example with sample data

使用GROUP BY标准查找缺失的序列值

2 个答案: