确定连续的日期间隔

时间:2015-05-18 12:33:35

标签: sql sql-server sql-server-2008 tsql

我有以下表结构:

id int -- more like a group id, not unique in the table
AddedOn datetime -- when the record was added

对于特定的id,每天最多 一条记录。我必须编写一个查询,为每个id返回连续的(在日级别)日期间隔。 预期的结果结构是:

id int
StartDate datetime
EndDate datetime

请注意,AddedOn的时间部分可用,但此处并不重要。

为了更清楚,这里有一些输入数据:

with data as 
(
  select * from
  (
    values
    (0, getdate()), --dummy record used to infer column types

    (1, '20150101'),
    (1, '20150102'),
    (1, '20150104'),
    (1, '20150105'),
    (1, '20150106'),

    (2, '20150101'),
    (2, '20150102'),
    (2, '20150103'),
    (2, '20150104'),
    (2, '20150106'),
    (2, '20150107'),

    (3, '20150101'),
    (3, '20150103'),
    (3, '20150105'),
    (3, '20150106'),
    (3, '20150108'),
    (3, '20150109'),
    (3, '20150110')
  ) as d(id, AddedOn)
  where id > 0 -- exclude dummy record
)
select * from data

预期结果:

id      StartDate      EndDate
1       2015-01-01     2015-01-02
1       2015-01-04     2015-01-06

2       2015-01-01     2015-01-04
2       2015-01-06     2015-01-07

3       2015-01-01     2015-01-01
3       2015-01-03     2015-01-03
3       2015-01-05     2015-01-06
3       2015-01-08     2015-01-10

虽然看起来像是一个常见的问题,但我找不到类似的问题。此外,我越来越接近解决方案了,我会在(如果)工作时发布它,但我觉得应该有更优雅的解决方案。

5 个答案:

答案 0 :(得分:5)

这里的答案没有任何花哨的加入,只是简单地使用group by和row_number,这不仅简单而且效率更高。

WITH CTE_dayOfYear
AS
(
    SELECT  id,
            AddedOn,
            DATEDIFF(DAY,'20000101',AddedOn) dyID,
            ROW_NUMBER() OVER (ORDER BY ID,AddedOn) row_num
    FROM data
)

SELECT  ID,
        MIN(AddedOn) StartDate,
        MAX(AddedOn) EndDate,
        dyID-row_num AS groupID
FROM CTE_dayOfYear
GROUP BY ID,dyID - row_num
ORDER BY ID,2,3

逻辑是dyID基于日期,因此存在间隙而row_num没有间隙。所以每次dyID都有差距,那么它会改变row_num和dyID之间的差异。然后我只是将这个差异用作我的groupID。

答案 1 :(得分:3)

Sql Server 2008中,如果没有LEADLAG函数,则会有点痛苦:

WITH    data
          AS ( SELECT   * ,
                        ROW_NUMBER() OVER ( ORDER BY id, AddedOn ) AS rn
               FROM     ( VALUES ( 0, GETDATE()), --dummy record used to infer column types
                        ( 1, '20150101'), ( 1, '20150102'), ( 1, '20150104'),
                        ( 1, '20150105'), ( 1, '20150106'), ( 2, '20150101'),
                        ( 2, '20150102'), ( 2, '20150103'), ( 2, '20150104'),
                        ( 2, '20150106'), ( 2, '20150107'), ( 3, '20150101'),
                        ( 3, '20150103'), ( 3, '20150105'), ( 3, '20150106'),
                        ( 3, '20150108'), ( 3, '20150109'), ( 3, '20150110') )
                        AS d ( id, AddedOn )
               WHERE    id > 0 -- exclude dummy record
             ),
        diff
          AS ( SELECT   d1.* ,
                        CASE WHEN ISNULL(DATEDIFF(dd, d2.AddedOn, d1.AddedOn),
                                         1) = 1 THEN 0
                             ELSE 1
                        END AS diff
               FROM     data d1
                        LEFT JOIN data d2 ON d1.id = d2.id
                                             AND d1.rn = d2.rn + 1
             ),
        parts
          AS ( SELECT   * ,
                        ( SELECT    SUM(diff)
                          FROM      diff d2
                          WHERE     d2.rn <= d1.rn
                        ) AS p
               FROM     diff d1
             )
    SELECT  id ,
            MIN(AddedOn) AS StartDate ,
            MAX(AddedOn) AS EndDate
    FROM    parts
    GROUP BY id ,
            p

输出:

id  StartDate               EndDate
1   2015-01-01 00:00:00.000 2015-01-02 00:00:00.000
1   2015-01-04 00:00:00.000 2015-01-06 00:00:00.000
2   2015-01-01 00:00:00.000 2015-01-04 00:00:00.000
2   2015-01-06 00:00:00.000 2015-01-07 00:00:00.000
3   2015-01-01 00:00:00.000 2015-01-01 00:00:00.000
3   2015-01-03 00:00:00.000 2015-01-03 00:00:00.000
3   2015-01-05 00:00:00.000 2015-01-06 00:00:00.000
3   2015-01-08 00:00:00.000 2015-01-10 00:00:00.000

操作实例:

<强> DIFF CTE返回数据:

1   2015-01-01 00:00:00.000 1   0
1   2015-01-02 00:00:00.000 2   0
1   2015-01-04 00:00:00.000 3   1
1   2015-01-05 00:00:00.000 4   0
1   2015-01-06 00:00:00.000 5   0

您正在连接相同的表以获取上一行。然后计算当前行和上一行之间的天数差异,如果结果为1天,则选择0,否则选择1。

<强>份CTE选择上一步的结果并总结新列(它是累计总和。从开始到当前行的新列的所有值的总和),因此您将获得分区以进行分组:

1   2015-01-01 00:00:00.000 1   0   0
1   2015-01-02 00:00:00.000 2   0   0
1   2015-01-04 00:00:00.000 3   1   1
1   2015-01-05 00:00:00.000 4   0   1
1   2015-01-06 00:00:00.000 5   0   1
2   2015-01-01 00:00:00.000 6   0   1
2   2015-01-02 00:00:00.000 7   0   1
2   2015-01-03 00:00:00.000 8   0   1
2   2015-01-04 00:00:00.000 9   0   1
2   2015-01-06 00:00:00.000 10  1   2
2   2015-01-07 00:00:00.000 11  0   2
3   2015-01-01 00:00:00.000 12  0   2
3   2015-01-03 00:00:00.000 13  1   3

最后一步是按IDnew column分组,并为日期选择minmax值。

答案 2 :(得分:2)

我从SQL MVP Deep Dives&#34;中选择了#34; Islands Solution#3;来自https://www.simple-talk.com/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/的解决方案并应用于您的测试数据:

with 
data as 
(
    select * from
    (
    values
    (0, getdate()), --dummy record used to infer column types

    (1, '20150101'),
    (1, '20150102'),
    (1, '20150104'),
    (1, '20150105'),
    (1, '20150106'),

    (2, '20150101'),
    (2, '20150102'),
    (2, '20150103'),
    (2, '20150104'),
    (2, '20150106'),
    (2, '20150107'),

    (3, '20150101'),
    (3, '20150103'),
    (3, '20150105'),
    (3, '20150106'),
    (3, '20150108'),
    (3, '20150109'),
    (3, '20150110')
    ) as d(id, AddedOn)
    where id > 0 -- exclude dummy record
)
,CTE_Seq
AS
(
    SELECT
        ID
        ,SeqNo
        ,SeqNo - ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo) AS rn
    FROM
        data
        CROSS APPLY
        (
            SELECT DATEDIFF(day, '20150101', AddedOn) AS SeqNo
        ) AS CA
)
SELECT
    ID
    ,DATEADD(day, MIN(SeqNo), '20150101') AS StartDate
    ,DATEADD(day, MAX(SeqNo), '20150101') AS EndDate
FROM CTE_Seq
GROUP BY ID, rn
ORDER BY ID, StartDate;

结果集

ID  StartDate               EndDate
1   2015-01-01 00:00:00.000 2015-01-02 00:00:00.000
1   2015-01-04 00:00:00.000 2015-01-06 00:00:00.000
2   2015-01-01 00:00:00.000 2015-01-04 00:00:00.000
2   2015-01-06 00:00:00.000 2015-01-07 00:00:00.000
3   2015-01-01 00:00:00.000 2015-01-01 00:00:00.000
3   2015-01-03 00:00:00.000 2015-01-03 00:00:00.000
3   2015-01-05 00:00:00.000 2015-01-06 00:00:00.000
3   2015-01-08 00:00:00.000 2015-01-10 00:00:00.000

我建议您检查CTE_Seq的中间结果,以了解其实际效果。只需加上

select * from CTE_Seq

而不是最终的SELECT ... GROUP BY ...。您将获得此结果集:

ID  SeqNo   rn
1   0   -1
1   1   -1
1   3   0
1   4   0
1   5   0
2   0   -1
2   1   -1
2   2   -1
2   3   -1
2   5   0
2   6   0
3   0   -1
3   2   0
3   4   1
3   5   1
3   7   2
3   8   2
3   9   2

每个日期都按DATEDIFF(day, '20150101', AddedOn)转换为序列号。 ROW_NUMBER()生成一组没有间隙的连续数字,因此当从具有间隙的序列中减去这些数字时,差异会跳跃/变化。差异在下一个差距之前保持不变,因此在最后SELECT GROUP BY ID, rn中,所有来自同一个岛的行都会聚集在一起。

答案 3 :(得分:2)

这是一个不使用分析的简单解决方案。我倾向于不使用分析,因为我使用了许多不同的DBMS,而且许多人还没有实现它们,甚至那些具有不同语法的人也是如此。我只是习惯于尽可能编写通用代码。

with
Data( ID, AddedOn )as(
  select 1, convert( date, '20150101' ) union all
  select 1, '20150102' union all
  select 1, '20150104' union all
  select 1, '20150105' union all
  select 1, '20150106' union all
  select 2, '20150101' union all
  select 2, '20150102' union all
  select 2, '20150103' union all
  select 2, '20150104' union all
  select 2, '20150106' union all
  select 2, '20150107' union all
  select 3, '20150101' union all
  select 3, '20150103' union all
  select 3, '20150105' union all
  select 3, '20150106' union all
  select 3, '20150108' union all
  select 3, '20150109' union all
  select 3, '20150110'
)
select  d.ID, d.AddedOn StartDate, IsNull( d1.AddedOn, '99991231' ) EndDate
from    Data    d
left join Data  d1
    on  d1.ID   = d.ID
    and d1.AddedOn  =(
        select  Min( AddedOn )
        from    data
        where   ID  = d.ID
        and AddedOn > d.AddedOn );

在您的情况下,我假设ID和AddedOn形成复合PK,因此被索引。因此,即使在非常大的表上,查询也会以惊人的速度运行。

另外,我使用了外连接,因为看起来应该在StartDate列中看到每个ID的最后一个AddedOn日期。我使用了一个常见的MaxDate值而不是NULL。 NULL可以和&#34;这是最新的StartDate行&#34;标志。

以下是ID = 1的输出:

ID          StartDate  EndDate
----------- ---------- ----------
1           2015-01-01 2015-01-02
1           2015-01-02 2015-01-04
1           2015-01-04 2015-01-05
1           2015-01-05 2015-01-06
1           2015-01-06 9999-12-31

答案 4 :(得分:1)

我也想发布自己的解决方案,因为这是另一种方法:

{{1}}