我有两组具有不同日期范围的数据。
Tbl 1:
ID, Date_Start, Date_End
1, 2010-01-01, 2010-01-09
1, 2010-01-10, 2010-01-19
1, 2010-01-30, 2010-01-31
Tbl 2:
ID, Date_Start, Date_End
1, 2010-01-01, 2010-01-04
1, 2010-01-08, 2010-01-17
1, 2010-01-30, 2010-01-31
我想查找案例日期范围与Tbl 2中的日期范围不完全重叠。例如,在本例中,我希望输出看起来像这样 -
Output:
ID, Gap_Start, Gap_End
1, 2010-01-05, 2010-01-07
1, 2010-01-18, 2010-01-19
日期范围永远不会在表格内重叠。为此,我使用的是DB2 SQL或SAS。不幸的是,数据集足够大(数百万条记录),我不能强迫它。
谢谢!
答案 0 :(得分:1)
我不认为所有案例都有效率和通用的解决方案。但是,在某些情况下,我们可以找出一些有效的方法。例如,下面假设:(1)数据集1和2具有相同顺序的id组; (2)可能的日期范围相对较短(此处假定为2010年的所有日期)。请注意,一个输入范围可能会产生两个间隙。
/* test data */
data one;
input id1 (start1 finish1) (:anydtdte.);
format start1 finish1 e8601da.;
cards;
1 2010-01-01 2010-01-09
1 2010-01-10 2010-01-19
1 2010-01-30 2010-01-31
2 2010-01-02 2010-01-10
;
run;
data two;
input id2 (start2 finish2) (:anydtdte.);
format start2 finish2 e8601da.;
cards;
1 2010-01-01 2010-01-04
1 2010-01-08 2010-01-17
1 2010-01-30 2010-01-31
2 2010-01-05 2010-01-06
;
run;
/* assumptions:
(1) datasets one and two have the same set of ids in the same
sorted order;
(2) only possible dates are in the year of 2010
*/
%let minDate = %sysevalf('01jan2010'd - 1);
%let maxDate = %sysevalf('31dec2010'd + 1);
data gaps;
array inRange[&minDate:&maxDate] _temporary_;
array covered[&minDate:&maxDate] _temporary_;
do i = &minDate to &maxDate; inRange[i] = 0; covered[i] = 0; end;
do until (last.id1);
set one;
by id1;
do i = start1 to finish1; inRange[i] = 1; end;
end;
do until (last.id2);
set two;
by id2;
do i = start2 to finish2; covered[i] = 1; end;
end;
format startGap finishGap e8601da.;
startGap = .;
finishGap = .;
do i = &minDate+1 to &maxDate;
if inRange[i] and not covered[i] and missing(startGap) then startGap = i;
if (covered[i] or not inRange[i]) and not missing(startGap) and not covered[i-1] then do;
finishGap = i - 1;
output;
call missing(startGap, finishGap);
keep id1 startGap finishGap;
end;
end;
run;
/* check */
proc print data=gaps noobs;
run;
/* on lst
id1 startGap finishGap
1 2010-01-05 2010-01-07
1 2010-01-18 2010-01-19
2 2010-01-02 2010-01-04
2 2010-01-07 2010-01-10
*/
答案 1 :(得分:1)
这不是一个完整的解决方案,因为它返回一个日期列表而不是范围,但它可能会有用:
SELECT
R1.ID, D.Date
FROM
#Ranges1 AS R1
INNER JOIN Dates AS D ON D.Date BETWEEN R1.StartDate AND R1.EndDate
EXCEPT
SELECT
R2.ID, D.Date
FROM
#Ranges2 AS R2
INNER JOIN Dates AS D ON D.Date BETWEEN R2.StartDate AND R2.EndDate
请注意,此解决方案需要一个日期表:一个表,每天有一条记录,适用于您可能使用的所有日期。它具有简洁,处理重叠日期范围的优点(在您的情况下不是必需的,但可能是下一个人)。
答案 2 :(得分:1)
继Jon of All Trades的方法之后,这是一个更完整的解决方案。关键特征是:
WITH
OutDates(ID, dt) AS
( SELECT Tbl1.ID, Calendar.dt FROM Calendar
INNER JOIN Tbl1 ON Calendar.dt BETWEEN Tbl1.Date_Start AND Tbl1.Date_End
LEFT OUTER JOIN Tbl2 ON Calendar.dt BETWEEN Tbl2.Date_Start AND Tbl2.Date_End
WHERE Tbl2.ID IS NULL
)
,
EarliestDates AS
( SELECT earliest.ID, earliest.dt FROM OutDates earliest
LEFT OUTER JOIN OutDates nonesuch_earlier ON DateAdd(day, -1, earliest.dt) = nonesuch_earlier.dt
WHERE nonesuch_earlier.ID IS NULL
)
,
LatestDates AS
( SELECT latest.ID, latest.dt FROM OutDates latest
LEFT OUTER JOIN OutDates nonesuch_later ON DATEADD(day, 1, latest.dt) = nonesuch_later.dt
WHERE nonesuch_later.ID IS NULL
)
SELECT rangestart.ID, rangestart.dt AS Gap_Start, rangeend.dt AS Gap_End
FROM EarliestDates rangestart JOIN LatestDates rangeend
ON rangestart.dt <= rangeend.dt
LEFT OUTER JOIN EarliestDates nonesuch_inner1
ON nonesuch_inner1.dt <= rangeend.dt AND nonesuch_inner1.dt > rangestart.dt
LEFT OUTER JOIN LatestDates nonesuch_inner2
ON nonesuch_inner2.dt >= rangestart.dt AND nonesuch_inner2.dt < rangeend.dt
WHERE nonesuch_inner1.dt IS NULL AND nonesuch_inner2.dt IS NULL
这是一个使用Sql Server语法处理公用表表达式的工作实现,但它应该很容易转换为DB2语法。我不知道它的规模如何,说实话,我只用一个非常小的数据集测试它。
答案 3 :(得分:0)
对于它的价值,这是我最终使用的方法。我认为你可以在纯SQL中做到这一点,但它变得非常丑陋且难以调试。
第1步 - 我合并了两个数据集中的日期范围。这意味着像
这样的东西ID, Start_Date, End_Date
1, 2010-01-01, 2010-01-31
1, 2010-02-01, 2010-02-28
变成了这个 -
ID, Start_Date, End_Date
1, 2010-01-01, 2010-02-28.
我以前用来产生这个问题的查询是 -
WITH Cte_recomb (Id, Start_date, End_date, Hopcount) AS
(SELECT Id,
Start_date,
End_date,
1 AS Hopcount
FROM Table1
UNION ALL
SELECT Cte_recomb.Id,
Cte_recomb.Start_date,
Table1.End_date,
(Recomb.Hopcount + 1) AS Hopcount
FROM Cte_recomb, Table1
WHERE (Cte_recomb.Id = Table1.Id) AND
(Cte_recomb.End_date + 1 day = Table1.Start_date)),
Cte_maxenddate AS
(SELECT Id,
Start_date,
Max (End_date) AS End_date
FROM Cte_recomb
GROUP BY Id, Start_date
ORDER BY Id, Start_date)
SELECT Maxend.*
FROM Cte_maxenddate AS Maxend
LEFT JOIN
Cte_recomb AS Nextrec
ON (Nextrec.Id = Maxend.Id) AND
(Nextrec.Start_date < Maxend.Start_date) AND
(Nextrec.End_date >= Maxend.End_date)
WHERE Nextrec.Id IS NULL;
第2步 -
我制作了另一个数据集,为两个数据集之间的每个重叠创建了一条记录。您需要一个额外的步骤来查找Table1中给定记录根本没有Table2中相应记录的情况。
SELECT Table1.Id,
Table1.Start_date AS Table1_start_date,
Table1.End_date AS Table1_end_date,
Table2.Start_date AS Table2_start_date,
Table2.End_date AS Table2_end_date
FROM Table1
INNER JOIN
Table2
ON (Table1.Plcy_id_sk = Id) AND
( (Table1.Start_date BETWEEN Table2.Start_date AND Table2.End_date) OR
(Table2.Start_date BETWEEN Table1.Start_date AND Table1.End_date)) AND
( (Table1.Start_date <> Table2.Start_date) OR
(Table1.End_date <> Table2.End_date))
ORDER BY Table1.Id, Table1.Start_date, Table2.Start_date;
第3步 -
我使用上面的数据集,并运行以下SAS作业。我尝试在纯SQL中使用递归查询来执行此操作,但每次查看它时都会变得更加丑陋和丑陋。
Data Table1_Gaps;
Set Table1_Compare;
By ID Table1_Start_Date Table2_Start_Date;
format Gap_Start_Date yymmdd10.;
format Gap_End_Date yymmdd10.;
format Old_Start_Date yymmdd10.;
format Old_End_Date yymmdd10.;
Retain Old_Start_Date Old_End_Date;
IF (Table2_End_Date = .) then do;
Gap_Start_Date = Table1_Start_Date;
Gap_End_Date = Table1_End_Date;
output;
end;
else do;
If (Table2_Start_Date > Table1_Start_Date) then do;
if first.Table1_Start_Date then do;
Gap_Start_Date = Table1_Start_Date;
Gap_End_Date = Table2_Start_Date - 1;
output;
end;
else do;
Gap_Start_Date = Old_End_Date + 1;
Gap_End_Date = Table2_Start_Date - 1;
output;
end;
end;
If (Table2_End_Date < Table1_End_Date) then do;
if Last.Table1_Start_Date then do;
Gap_Start_Date = Table2_End_Date + 1;
Gap_End_Date = Table1_End_Date;
output;
end;
end;
end;
Old_Start_Date = Table2_Start_Date;
Old_End_Date = Table2_End_Date;
drop Old_Start_Date Old_End_Date;
run;
我还没有完全验证它,但这种方法似乎确实给了我想要的结果。有什么想法吗?