我有一个很大的数据集,其中包含每个id和参考日期具有不同疾病状态的日期时段。我想为每个id的参考日期起+/- 5年内的所有缺失日期添加一个“健康”状态。
我尝试在此处修改解决方案:Fill in missing date ranges,但失败了。最好,我想保留data.table框架。任何建议,我们将不胜感激!
样本数据:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
所需的输出:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2005-03-16 2008-10-10 0
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2008-10-13 2014-11-04 0
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2008-05-10 2012-01-31 0
2 2013-05-10 2012-02-01 2012-03-01 2
2 2013-05-10 2012-03-02 2018-05-10 0
2 2014-06-11 2009-06-11 2012-01-31 0
2 2014-06-11 2012-02-01 2012-03-01 2
2 2014-06-11 2012-03-02 2019-06-11 0
3 2011-08-14 2006-08-14 2016-08-14 0
")
评论: 对于第一行,+ /-5年日期间隔是从2005-01-10到2015-01-10。但是,由于持续到2005-03-15的疾病状态,“健康”期始于2005-03-16。由于每个id可能有多个参考日期,因此会出现重复的日期段(如id 2所观察到的:2012-02-01-2012-03-01),并且可以。最后,没有疾病状态的ID由NA表示(ID 3)。
编辑:我对真实数据有一些问题,所以我对解决方案进行了一些调整;还添加了状态,以便按日期间隔折叠状态:
DT2 <- DT[,{
# +/-5 years from t0
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(start[1L])) {
# replace NA with full time interval for 'healthy'
.(period_start=sdt, period_end=edt, status='notsick')
} else{
# Add date for -5 years if it is the minimum, otherwise use existing minimum
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
# Add date for +5 years if it is the maximum, otherwise use existing maximum
if (edt > period_end[.N]) {
period_end <- c(period_end,edt)
}
dates=unique(sort(c(period_start, period_end+1L)))
.(start=dates[-length(dates)],end=dates[-1L]-1,status='')
}
},
.(id,reference_date)]
## (c). Collapse status for overlapping periods
DT <- DT[DT2, on = .(id,reference_date, period_start <= period_start, period_end >= period_end), {
status <- paste(status, collapse = ";")
.(status=status)},
by = .EACHI, allow.cartesian = TRUE]
答案 0 :(得分:1)
这是一个选项:
interweave <- function(x, y) c(rbind(x, y)) #see ref
ans <- DT[, {
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(period_start[1L])) {
.(period_start=sdt, period_end=edt, Status=0L)
} else {
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
ps <- as.IDate(sort(interweave(period_start, period_end+1L)))
if (period_end[.N] > edt) {
ps <- ps[-length(ps)]
pe <- period_end[.N]
} else {
pe <- edt
}
.(period_start=ps, period_end=c(ps[-1L] - 1, pe), Status=0L)
}
},
.(id, reference_date)]
ans[DT, on=setdiff(names(DT), "Status"), Status := i.Status]
ans
数据:
library(data.table)
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
cols <- c("reference_date","period_start","period_end")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]