拥有药物时间序列数据,并对此类结构的多个受试者进行重复评估(使用data.table):
library(data.table)
dt1 = setDT(structure(list(id = c("G", "G", "G", "G", "M", "M", "M", "M",
"M", "M", "M"), med = c("mult", "R", "mult", "R", "A", "mult",
"A", "C", "A", "Q", "A"), strt = c(19059L, 19061L, 19065L, 19066L,
19136L, 19138L, 19142L, 19142L, 19155L, 19246L, 19257L), end = c(19061L,
19065L, 19066L, 19101L, 19138L, 19139L, 19172L, 19172L, 19255L,
19276L, 19287L)), .Names = c("id", "med", "strt", "end"), row.names = c(NA,
-11L), class = "data.frame"))
生成data.table dt1
:
id med strt end
1: G mult 19059 19061
2: G R 19061 19065
3: G mult 19065 19066
4: G R 19066 19101
5: M A 19136 19138
6: M mult 19138 19139
7: M A 19142 19172
8: M C 19142 19172
9: M A 19155 19255
10: M Q 19246 19276
11: M A 19257 19287
我正在尝试重新组织数据,以便对于每个受试者,患者所在的任何一天> 1 med被记录为'mult'
,并且给定药物治疗方案的连续天数表示为单个行。
因此,期望的结果是dt2
:
id med strt end
1: G mult 19059 19061
2: G R 19062 19064
3: G mult 19065 19066
4: G R 19067 19101
5: M A 19136 19137
6: M mult 19138 19139
7: M mult 19142 19172
8: M A 19173 19245
9: M mult 19246 19255
10: M Q 19256 19256
11: M mult 19257 19276
12: M A 19277 19287
我编写了以下代码来执行此操作,但它很慢而且冗长。有人可以帮我改进吗?
dt2 = dt1[, list(id, med, day=seq(strt,end)), by=1:nrow(dt1)]
setkey(dt2,'id','day')
dt2[, med := ifelse(length(unique(med))>1, 'mult', med), by=list(id,day)]
dt2 = unique(dt2)
medrun <- function(y,z){
cnt = grp = 1L
lx = length(y)
ne = y[-lx] != y[-1L]
n1 = z[-lx] - z[-1L] != -1
for(i in seq_along(ne)){if(ne[i] | n1[i])cnt=cnt+1; grp[i+1]=cnt}
grp
}
dt2[,grp := as.numeric(medrun(med,day)), by=id]
setkey(dt2,'id','grp')
dt2[,strt := min(day), by=list(id,grp)]
dt2[,end := max(day), by=list(id,grp)]
dt2 = unique(dt2)
dt2 = subset(dt2, select = c('id','med','strt','end'))
数据集很大(> 3M行),因此解决方案需要内存高效且快速。理想情况下,不希望将间隔扩展为1个/天。
答案 0 :(得分:1)
我会按天扩展,设置密钥,然后相应地计算
DT_meds <- DT1[, list(day = if (.N ==1) seq(from=strt, to=end) else unlist(mapply(seq, strt, end))), keyby=list(id, med)]
setkey(DT_meds, id, day, med)
DT_meds[, med := if (length(unique(med)) > 1) "mult" else med, by=list(id, day)]
DT_meds[, grp := cumsum (c(FALSE, diff(day) > 1)), by=list(id, med) ]
DT_results <- DT_meds[, list(str=day[1L], end=day[.N]), by=list(id, med, grp)]
DT_results[, grp := NULL]
DT_results
# id med str end
# 1: G mult 19059 19061
# 2: G R 19062 19064
# 3: G mult 19065 19066
# 4: G R 19067 19101
# 5: M A 19136 19137
# 6: M mult 19138 19139
# 7: M mult 19142 19172
# 8: M A 19173 19245
# 9: M mult 19246 19255
# 10: M Q 19256 19256
# 11: M mult 19257 19276
# 12: M A 19277 19287