我有多行具有相同的ID,并且每一行都有日期范围。有时,这些日期范围重叠。我需要确定它们重叠的行。
EG数据集:
eg_data <- data.frame(
id = c(1,1,1, 2,2, 3,3,3,3,3,3, 4,4, 5,5,5,5),
start_dt = c("01/01/2016", "12/02/2016", "03/12/2017", "02/01/2016",
"08/12/2016", "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016",
"09/04/2016", "10/10/2016", "01/01/2016", "05/28/2016", "01/01/2016",
"06/05/2016", "08/25/2016", "11/01/2016"),
end_dt = c("12/01/2016", "03/14/2017", "05/15/2017", "05/15/2016",
"12/29/2016", "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016",
"09/25/2016", "11/29/2016", "05/31/2016", "08/19/2016", "06/10/2016",
"07/25/2016", "08/29/2016", "12/30/2016"))
eg_data$row_n <- 1:nrow(eg_data)
运行上面的eg数据,您会看到
第3行的开始日期与ID#1的第2行的结束日期重叠;第13行的开始日期与ID#4的第12行的结束日期重叠;并且第15行的开始日期与ID#5的第14行的结束日期重叠。
我需要能够针对单个ID号识别何时发生这种类型的重叠。
感谢您的帮助。谢谢!
答案 0 :(得分:0)
首先将日期转换为Date
类。然后在id
上进行自我联接,并且相交条件将联接所有相关的重叠行。如果该行具有重叠,则overlap
为1,否则为0。 overlaps
列出了该行重叠的行号。我们使用了行号rowid
,但是如果需要的话,我们可以在下面的代码中将每次出现的行替换为row_n
。
library(sqldf)
fmt <- "%m/%d/%Y"
eg2 <- transform(eg_data,
start_dt = as.Date(start_dt, fmt),
end_dt = as.Date(end_dt, fmt))
sqldf("select
a.*,
count(b.rowid) > 0 as overlap,
coalesce(group_concat(b.rowid), '') as overlaps
from eg2 a
left join eg2 b on a.id = b.id and
not a.rowid = b.rowid and
((a.start_dt between b.start_dt and b.end_dt) or
(b.start_dt between a.start_dt and a.end_dt))
group by a.rowid
order by a.rowid")
给予:
id start_dt end_dt row_n overlap overlaps
1 1 2016-01-01 2016-12-01 1 0
2 1 2016-12-02 2017-03-14 2 1 3
3 1 2017-03-12 2017-05-15 3 1 2
4 2 2016-02-01 2016-05-15 4 0
5 2 2016-08-12 2016-12-29 5 0
6 3 2016-01-01 2016-03-02 6 0
7 3 2016-03-05 2016-04-29 7 0
8 3 2016-05-07 2016-06-29 8 0
9 3 2016-07-01 2016-08-31 9 0
10 3 2016-09-04 2016-09-25 10 0
11 3 2016-10-10 2016-11-29 11 0
12 4 2016-01-01 2016-05-31 12 1 13
13 4 2016-05-28 2016-08-19 13 1 12
14 5 2016-01-01 2016-06-10 14 1 15
15 5 2016-06-05 2016-07-25 15 1 14
16 5 2016-08-25 2016-08-29 16 0
17 5 2016-11-01 2016-12-30 17 0