我正在开发一个功能,以确定一系列开始/结束日期的差距。如果开始日期晚于任何前一个结束日期后的1天,则输出应为FALSE。
DATA:
df <- data.frame('ID' = c('1','1','1','1','1','1'), 'start' = as.Date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')),
'end' = as.Date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12')))
期望的输出:
ID start end continuous
1 1 2010-01-01 2010-01-03 FALSE
2 1 2010-01-03 2010-01-22 TRUE
3 1 2010-01-05 2010-01-07 TRUE
4 1 2010-01-09 2010-01-12 TRUE
5 1 2010-02-01 2010-02-10 FALSE
6 1 2010-02-10 2010-02-12 TRUE
此代码在此小数据集上获得所需的结果:
df$continuous <-
sapply(split(df, df$ID),
function(x) {
lapply(1:nrow(x),
function(y) {
any(x$start[y] - x$end[-(y:NROW(x$end))] <= 1)
})
})
然而,将此应用于具有许多不同ID的更大集合(> 100,000次观察),它仍然会产生错误的输出。例如:
ID start end continuous
2 2015-01-15 2015-01-15 FALSE
2 2015-01-16 2015-01-17 TRUE
2 2015-01-16 2015-01-17 FALSE #wrong, should be TRUE
2 2015-01-17 2015-01-19 TRUE
2 2015-01-20 2015-01-22 TRUE
2 2015-01-22 2015-01-23 FALSE #wrong, should be TRUE
2 2015-01-26 2015-01-26 TRUE
2 2015-01-26 2015-01-30 TRUE
2 2015-01-26 2015-01-26 FALSE #wrong, should be TRUE
2 2015-02-01 2015-02-06 TRUE #wrong, should be FALSE
任何人都知道为什么?