按组r填写缺少的日期间隔

时间:2020-01-21 11:46:44

标签: r date data.table

我有一个很大的数据集,其中包含每个id和参考日期具有不同疾病状态的日期时段。我想为每个id的参考日期起+/- 5年内的所有缺失日期添加一个“健康”状态。

我尝试在此处修改解决方案:Fill in missing date ranges,但失败了。最好,我想保留data.table框架。任何建议,我们将不胜感激!

样本数据:

DT <- fread("
id  reference_date  period_start  period_end   Status
1   2010-01-10      2004-06-22    2005-03-15   1
1   2010-01-10      2008-10-11    2008-10-12   1
1   2010-01-10      2014-11-05    2016-01-03   2
2   2013-05-10      2012-02-01    2012-03-01   2
2   2014-06-11      2012-02-01    2012-03-01   2
3   2011-08-14      NA            NA           NA 
")

所需的输出:

DT <- fread("
id  reference_date  period_start  period_end   Status
1   2010-01-10      2004-06-22    2005-03-15   1
1   2010-01-10      2005-03-16    2008-10-10   0   
1   2010-01-10      2008-10-11    2008-10-12   1
1   2010-01-10      2008-10-13    2014-11-04   0
1   2010-01-10      2014-11-05    2016-01-03   2
2   2013-05-10      2008-05-10    2012-01-31   0
2   2013-05-10      2012-02-01    2012-03-01   2
2   2013-05-10      2012-03-02    2018-05-10   0
2   2014-06-11      2009-06-11    2012-01-31   0
2   2014-06-11      2012-02-01    2012-03-01   2
2   2014-06-11      2012-03-02    2019-06-11   0
3   2011-08-14      2006-08-14    2016-08-14   0 
")

评论: 对于第一行,+ /-5年日期间隔是从2005-01-10到2015-01-10。但是,由于持续到2005-03-15的疾病状态,“健康”期始于2005-03-16。由于每个id可能有多个参考日期,因此会出现重复的日期段(如id 2所观察到的:2012-02-01-2012-03-01),并且可以。最后,没有疾病状态的ID由NA表示(ID 3)。

编辑:我对真实数据有一​​些问题,所以我对解决方案进行了一些调整;还添加了状态,以便按日期间隔折叠状态:

 DT2 <- DT[,{

        # +/-5 years from t0
        sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
        edt <- seq(reference_date, by="5 years", length.out=2L)[2L]

        if(is.na(start[1L])) {
          # replace NA with full time interval for 'healthy'
          .(period_start=sdt, period_end=edt, status='notsick')
        } else{
          # Add date for -5 years if it is the minimum, otherwise use existing minimum
          if (sdt < period_start[1L]) {
            period_start <- c(sdt, period_start)
          }
          # Add date for +5 years if it is the maximum, otherwise use existing maximum
          if (edt > period_end[.N]) {
            period_end <- c(period_end,edt)
          }
          dates=unique(sort(c(period_start, period_end+1L)))
          .(start=dates[-length(dates)],end=dates[-1L]-1,status='')
        }
      },
      .(id,reference_date)]

      ## (c). Collapse status for overlapping periods
      DT <- DT[DT2, on = .(id,reference_date, period_start <= period_start, period_end >= period_end), {
        status <- paste(status, collapse = ";")
        .(status=status)},
        by = .EACHI, allow.cartesian = TRUE]

1 个答案:

答案 0 :(得分:1)

这是一个选项:

interweave <- function(x, y) c(rbind(x, y)) #see ref
ans <- DT[, {
        sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
        edt <- seq(reference_date, by="5 years", length.out=2L)[2L]

        if(is.na(period_start[1L])) {
            .(period_start=sdt, period_end=edt, Status=0L)
        } else {    
            if (sdt < period_start[1L]) {
                period_start <- c(sdt, period_start)
            } 
            ps <- as.IDate(sort(interweave(period_start, period_end+1L)))

            if (period_end[.N] > edt) {
                ps <- ps[-length(ps)]
                pe <- period_end[.N]
            } else {
                pe <- edt
            }
            .(period_start=ps, period_end=c(ps[-1L] - 1, pe), Status=0L)
        }
    },
    .(id, reference_date)]
ans[DT, on=setdiff(names(DT), "Status"), Status := i.Status]
ans

数据:

library(data.table)
DT <- fread("
id  reference_date  period_start  period_end   Status
1   2010-01-10      2004-06-22    2005-03-15   1
1   2010-01-10      2008-10-11    2008-10-12   1
1   2010-01-10      2014-11-05    2016-01-03   2
2   2013-05-10      2012-02-01    2012-03-01   2
2   2014-06-11      2012-02-01    2012-03-01   2
3   2011-08-14      NA            NA           NA 
")
cols <- c("reference_date","period_start","period_end")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]

参考: Alternate, interweave or interlace two vectors

相关问题