SQL Server压缩相邻日期范围

时间:2017-05-04 14:44:33

标签: sql-server compression range lag lead

我有一张表,其中包含人员ID和日期范围(开始日期和停止日期)。每个人可能有多行,包含多个开始和结束日期。

create table #DateRanges (
   tableID   int not null,
   personID  int not null,
   startDate date,
   endDate   date
);
insert #DateRanges (tableID, personID, startDate, endDate)
values (1, 100, '2011-01-01', '2011-01-31') -- Just January
     , (2, 100, '2011-02-01', '2011-02-28') -- Just February
     , (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
     , (4, 100, '2011-05-01', '2011-05-31') -- May
     , (5, 100, '2011-06-01', '2011-12-31') -- June through December

我需要一种方法来折叠相邻的日期范围(前一行的结束日期正好是下一行的开始日期前一天)。但它必须包括所有连续范围,只有当端到端间隙大于一天时才会分裂。以上数据需要压缩为:

+-----------+----------+--------------+------------+
| SomeNewID | PersonID | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|        1  |     100  |   2011-01-01 | 2011-02-28 |
+-----------+----------+--------------+------------+
|        2  |     100  |   2011-04-01 | 2011-12-31 |
+-----------+----------+--------------+------------+

只有两行,因为唯一缺失的范围是三月。现在,如果所有的游行都存在,无论是一行还是多行,压缩只会产生一行。但是如果三月中旬只有两天出现,我们将获得第三排显示3月份的日期。

我一直在使用SQL 2016中的LEAD和LAG函数尝试将其作为记录集操作完成,但到目前为止已经空白。我希望能够在没有循环和RBAR的情况下完成它,但我没有看到解决方案。

2 个答案:

答案 0 :(得分:0)

您可以使用滞后并获取正确的存储桶,然后执行以下分组:

;with cte1 as (
    select *,dtdiff = datediff(day, lag(startdate, 1, null) over (partition by personid order by startdate), startDate) --Getting date difference for grouping
     from #DateRanges
        ),
cte2 as (
    select *, grp = sum(case when dtdiff is null or dtdiff>50 then 1 else 0 end) over (order by startdate) -- Creating bucket for min/max
        from cte1
        )
        select SomeNewId = Row_Number() over (order by (select null)), Personid, NewStartDate = min(startdate), NewEndDate = max(enddate) --Getting min/max based on bucket
            from cte2 group by PersonId, grp

你的输出:

+-----------+----------+--------------+------------+
| SomeNewId | Personid | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|         1 |      100 | 2011-01-01   | 2011-02-28 |
|         2 |      100 | 2011-04-01   | 2011-12-31 |
+-----------+----------+--------------+------------+

我的测试输入:

insert #DateRanges (tableID, personID, startDate, endDate)
values (1, 100, '2011-01-01', '2011-01-31') -- Just January
     , (2, 100, '2011-02-01', '2011-02-28') -- Just February
     , (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
     , (4, 100, '2011-05-01', '2011-05-31') -- May
     , (5, 100, '2011-06-01', '2011-06-30') -- More gaps
     , (6, 100, '2011-07-01', '2011-07-31') -- More gaps
     , (7, 100, '2011-08-01', '2011-08-31') -- More gaps
     , (8, 100, '2011-10-01', '2011-10-31') -- More gaps
     , (9, 100, '2011-11-01', '2011-11-30') -- More gaps

测试数据的输出:

+-----------+----------+--------------+------------+
| SomeNewId | Personid | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|         1 |      100 | 2011-01-01   | 2011-02-28 |
|         2 |      100 | 2011-04-01   | 2011-08-31 |
|         3 |      100 | 2011-10-01   | 2011-11-30 |
+-----------+----------+--------------+------------+

答案 1 :(得分:0)

经过几天的努力,我想我有一个我想分享的解决方案,以防任何其他人需要类似的东西。我使用了几个CTE来查找超前,滞后和间隙时间,将行提取到只有重要的开始和停止日期,然后使用更多的超前和滞后来查找压缩的开始和停止日期。可能有一种更简单的方法,但我认为这很好地处理了日级解决方案。

if-else