删除重复的日期

时间:2012-10-26 20:32:23

标签: tsql date duplicates duplicate-removal

我正在编写脚本以从具有数百万行的数据库中获取数据,并且存在时段间隙的问题。我们已经决定,不应该将不到10天的差距视为差距。因此,应删除这些差距(参见下面的例子。大胆的日期构成了“真正的”感兴趣的时期)

  • ID InDate OutDate
  • 1 2008-10-10 2009-02-05
  • 1 2009-02-08 2009-05-13
  • 1 2011-01-01 2011-05-20
  • 2 2007-03-17 2008-10-19
  • 2 2009-05-30 2010-10-12
  • 2 2010-10-14 2010-12-31

因此,出现了几个问题。第一个问题是确定哪个Outdates和Indates在转换为单个期间彼此接近。下一个问题是将Outdate从较高的行号移动到较低的行号(在表中)。最后一个问题是识别并删除现在重复的行。

我试图解决下面的问题。表#t4a解决了前两个问题。表#t4aa中的策略是通过在新的(虚拟)变量中标记所讨论的重复行来消除重复,并在稍后的阶段中去除所有这些值(1:s)。但是,它不起作用!所有行都标有0,甚至是那些应标记为1的行。任何建议?

- 此临时表测量间隙并创建一个新变量OutDate2,在小间隙(小于11天)的情况下,在行上写下一个Outdate而不是原始值。

WITH C AS (SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) Rownum FROM #t4 t4)  
SELECT cur.Rownum, cur.Id, cur.InDate CurInDate, cur.OutDate, nxt.InDate NxtInDate, DATEDIFF(day, cur.OutDate, nxt.InDate) Number_of_days,   
  CASE WHEN DATEDIFF(day, cur.OutDate, nxt.InDate)<11 AND DATEDIFF(day, cur.OutDate, nxt.InDate)>0 THEN nxt.OutDate ELSE cur.OutDate END AS OutDate2  
INTO #t4a  
FROM C cur  
LEFT OUTER JOIN C nxt ON (nxt.rownum=cur.rownum+1 AND nxt.Id=cur.Id)

- 此临时表创建一个虚拟对象,用于标识行的OVERLAP,以便在以后的临时表中消除这些行。这个表不起作用。

WITH C AS (SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) rownum FROM #t4a)  
SELECT cur.Id, cur.InDate, nxt.OutDate2,   
  CASE WHEN cur.OutDate2 < nxt.InDate THEN 1.0 ELSE 0.0
  END AS Overlap  
INTO #t4aa  
FROM C cur  
LEFT OUTER JOIN C nxt on (cur.rownum=nxt.rownum+1 AND cur.Id=nxt.Id)

2 个答案:

答案 0 :(得分:1)

这是一种概念,但可能会给你一些想法

WITH C AS 
(SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) Rownum FROM #t4 t4) 

    select Cgood.* 
    from c 
    join C as Cgood 
      on Cgood.ID = C1.ID 
     and Cgood.Rownum = C.Rownum + 1
     and DATEDIFF(day, C.OutDate, nxt.InDate)>=11
    group by Cgood.* 
    union 
    select Cgood.*
    from c  
    join C as Cgood 
      on Cgood.ID = C1.ID 
     and Cgood.Rownum = 1 
     and C.Rownum = 2 
     and DATEDIFF(day, C.OutDate, nxt.InDate)>=11
    group by Cgood.* 
    union
    select cMerge.ID, c.Indate, cMerge.OutDate
    from c
    join C as cMerge 
      on cMerge.ID = C1.ID 
     and cMerge.Rownum = C.Rownum + 1
     and DATEDIFF(day, C.OutDate, cMerge.InDate) < 11
    group by cMerge.ID, c.Indate, cMerge.OutDate
    union
    select cMerge.ID, c.Indate, cMerge.OutDate
    from c
    join C as cMerge 
      on cMerge.ID = C1.ID 
     and cMerge.Rownum = 1 
     and C.Rownum = 2
     and DATEDIFF(day, C.OutDate, cMerge.InDate) < 11
    group b

答案 1 :(得分:1)

我昨天解决了自己的问题。我摆脱了最后一个临时表,并在第一个临时表中创建了虚拟变量。解决方案的核心是向后和向前加入。

WITH C AS (SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) Rownum FROM #t4 t4)  
SELECT cur.Rownum, cur.Id, cur.InDate CurInDate, cur.OutDate, nxt.InDate NxtInDate, DATEDIFF(day, cur.OutDate, nxt.InDate) Number_of_days,  
CASE  
WHEN DATEDIFF(day, prv.OutDate, cur.InDate)<11  
AND DATEDIFF(day, prv.OutDate, cur.InDate)>0  
THEN 1.0  
ELSE 0.0  
END AS Overlap,      
CASE  
WHEN DATEDIFF(day, cur.OutDate, nxt.InDate)<11  
AND DATEDIFF(day, cur.OutDate, nxt.InDate)>0  
THEN nxt.OutDate  
ELSE cur.OutDate  
END AS OutDate2  
INTO #t4a  
FROM C cur  
LEFT OUTER JOIN C prv ON (prv.rownum=cur.rownum-1 AND prv.Id=cur.Id)  
LEFT OUTER JOIN C nxt ON (nxt.rownum=cur.rownum+1 AND nxt.Id=cur.Id)