Question

我有多行具有相同的ID，并且每一行都有日期范围。有时，这些日期范围重叠。我需要确定它们重叠的行。

EG数据集：

eg_data <- data.frame(
id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
start_dt = c("01/01/2016", "12/02/2016", "03/12/2017",  "02/01/2016", 
"08/12/2016",  "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016", 
"09/04/2016", "10/10/2016",  "01/01/2016", "05/28/2016",  "01/01/2016", 
"06/05/2016", "08/25/2016", "11/01/2016"),  
end_dt =   c("12/01/2016", "03/14/2017", "05/15/2017",  "05/15/2016", 
"12/29/2016",  "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016", 
"09/25/2016", "11/29/2016",  "05/31/2016", "08/19/2016",  "06/10/2016", 
"07/25/2016", "08/29/2016", "12/30/2016"))
eg_data$row_n <- 1:nrow(eg_data)

运行上面的eg数据，您会看到

第3行的

开始日期与ID＃1的第2行的结束日期重叠；第13行的开始日期与ID＃4的第12行的结束日期重叠；并且第15行的开始日期与ID＃5的第14行的结束日期重叠。

我需要能够针对单个ID号识别何时发生这种类型的重叠。

感谢您的帮助。谢谢！

Answer 1

首先将日期转换为Date类。然后在id上进行自我联接，并且相交条件将联接所有相关的重叠行。如果该行具有重叠，则overlap为1，否则为0。 overlaps列出了该行重叠的行号。我们使用了行号rowid，但是如果需要的话，我们可以在下面的代码中将每次出现的行替换为row_n。

library(sqldf)

fmt <- "%m/%d/%Y"
eg2 <- transform(eg_data, 
  start_dt = as.Date(start_dt, fmt),
  end_dt = as.Date(end_dt, fmt))


sqldf("select 
    a.*, 
    count(b.rowid) > 0 as overlap, 
    coalesce(group_concat(b.rowid), '') as overlaps
  from eg2 a
  left join eg2 b on a.id = b.id and 
                     not a.rowid = b.rowid and
                     ((a.start_dt between b.start_dt and b.end_dt) or
                     (b.start_dt between a.start_dt and a.end_dt))
  group by a.rowid
  order by a.rowid")

给予：

   id   start_dt     end_dt row_n overlap overlaps
1   1 2016-01-01 2016-12-01     1       0         
2   1 2016-12-02 2017-03-14     2       1        3
3   1 2017-03-12 2017-05-15     3       1        2
4   2 2016-02-01 2016-05-15     4       0         
5   2 2016-08-12 2016-12-29     5       0         
6   3 2016-01-01 2016-03-02     6       0         
7   3 2016-03-05 2016-04-29     7       0         
8   3 2016-05-07 2016-06-29     8       0         
9   3 2016-07-01 2016-08-31     9       0         
10  3 2016-09-04 2016-09-25    10       0         
11  3 2016-10-10 2016-11-29    11       0         
12  4 2016-01-01 2016-05-31    12       1       13
13  4 2016-05-28 2016-08-19    13       1       12
14  5 2016-01-01 2016-06-10    14       1       15
15  5 2016-06-05 2016-07-25    15       1       14
16  5 2016-08-25 2016-08-29    16       0         
17  5 2016-11-01 2016-12-30    17       0

通过ID R识别重叠的日期范围

1 个答案: