确定大数据的时间差距

时间:2017-08-15 09:44:57

标签: r time apply intervals

我正在开发一个功能,以确定一系列开始/结束日期的差距。如果开始日期晚于任何前一个结束日期后的1天,则输出应为FALSE。

DATA:

df <- data.frame('ID' = c('1','1','1','1','1','1'), 'start' = as.Date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')),
                 'end' = as.Date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12')))

期望的输出:

  ID      start        end  continuous
1  1 2010-01-01 2010-01-03 FALSE
2  1 2010-01-03 2010-01-22 TRUE
3  1 2010-01-05 2010-01-07 TRUE
4  1 2010-01-09 2010-01-12 TRUE
5  1 2010-02-01 2010-02-10 FALSE
6  1 2010-02-10 2010-02-12 TRUE 

此代码在此小数据集上获得所需的结果:

df$continuous <-
  sapply(split(df, df$ID),
                function(x) {
                  lapply(1:nrow(x),
                         function(y) {
                           any(x$start[y] - x$end[-(y:NROW(x$end))] <= 1)
                         })
                })

然而,将此应用于具有许多不同ID的更大集合(> 100,000次观察),它仍然会产生错误的输出。例如:

 ID         start       end            continuous
 2    2015-01-15   2015-01-15             FALSE
 2    2015-01-16   2015-01-17             TRUE
 2    2015-01-16   2015-01-17            FALSE #wrong, should be TRUE
 2    2015-01-17   2015-01-19             TRUE
 2    2015-01-20   2015-01-22             TRUE
 2    2015-01-22   2015-01-23            FALSE #wrong, should be TRUE
 2    2015-01-26   2015-01-26             TRUE
 2    2015-01-26   2015-01-30             TRUE
 2    2015-01-26   2015-01-26            FALSE #wrong, should be TRUE
 2    2015-02-01   2015-02-06             TRUE #wrong, should be FALSE

任何人都知道为什么?

0 个答案:

没有答案