我确信这是一种更有效的方法,但我还没有找到它。下面的代码需要超过2.5小时的CPU时间来处理超过一百万条记录。
示例输入数据:
PATIENTID,admit_dt,discharge_dt,days_since_last_discharge,window_end_dt
55,35684,35688,NA,0
55,35693,35697,5,0
55,35719,35724,22,0
55,35738,35745,14,0
55,35758,35763,13,0
55,35798,35808,35,0
55,35817,35831,9,0
1564,31322,31339,NA,0
1564,31342,31350,3,0
1564,31353,31370,3,0
1564,31373,31438,3,0
1564,31439,31456,1,0
1564,31477,31480,NA,0
1564,31486,31489,6,0
1564,31499,31506,10,0
1564,31512,31522,6,0
1564,31525,31545,NA,0
1564,31547,31559,2,0
1564,31563,31568,4,0
1564,31606,31630,38,0
1564,31643,31653,13,0
1564,31656,31669,3,0
1564,31670,31680,1,0
1564,31685,31701,5,0
1564,31710,31713,9,0
1564,31724,31725,11,0
1564,31726,31733,1,0
1564,31753,31762,20,0
1564,31769,31770,7,0
1564,31807,31824,37,0
1564,31828,31831,4,0
1564,31981,31989,150,0
1564,32003,32008,14,0
我尝试过并且工作缓慢的方法:
window_size <- 30
last_window_dt <- 0
for (row in 1:nrow(sample_df)) {
if(is.na(sample_df[row, "days_since_last_discharge"])) {
sample_df[row, "window_end_dt"] <- sample_df[row, "discharge_dt"] + window_size
last_window_dt <- sample_df[row, "discharge_dt"] + window_size
}
else {if (sample_df[row, "admit_dt"] <= last_window_dt) {
sample_df[row, "window_end_dt"] <- last_window_dt
} else {
sample_df[row, "window_end_dt"] <- sample_df[row, "discharge_dt"] + window_size
last_window_dt <- sample_df[row, "discharge_dt"] + window_size
}
}
}
替代实际需要更长时间才能执行:
window_size <- 30
last_window_dt <- 0
for (row in 1:nrow(sample_df)) {
ifelse(is.na(sample_df[row, "days_since_last_discharge"]) | sample_df[row, "admit_dt"] > last_window_dt,
last_window_dt <- sample_df[row, "discharge_dt"] + window_size,
last_window_dt
)
ifelse(is.na(sample_df[row, "days_since_last_discharge"]) | sample_df[row, "admit_dt"] > last_window_dt,
sample_df[row, "window_end_dt"] <- sample_df[row, "discharge_dt"] + window_size,
sample_df[row, "window_end_dt"] <- last_window_dt
)
}
期望的输出:
PATIENTID,admit_dt,discharge_dt,days_since_last_discharge,window_end_dt
55,35684,35688,NA,35718
55,35693,35697,5,35718
55,35719,35724,22,35754
55,35738,35745,14,35754
55,35758,35763,13,35793
55,35798,35808,35,35838
55,35817,35831,9,35838
1564,31322,31339,NA,31369
1564,31342,31350,3,31369
1564,31353,31370,3,31369
1564,31373,31438,3,31468
1564,31439,31456,1,31468
1564,31477,31480,NA,31510
1564,31486,31489,6,31510
1564,31499,31506,10,31510
1564,31512,31522,6,31552
1564,31525,31545,NA,31575
1564,31547,31559,2,31575
1564,31563,31568,4,31575
1564,31606,31630,38,31660
1564,31643,31653,13,31660
1564,31656,31669,3,31660
1564,31670,31680,1,31710
1564,31685,31701,5,31710
1564,31710,31713,9,31710
1564,31724,31725,11,31755
1564,31726,31733,1,31755
1564,31753,31762,20,31755
1564,31769,31770,7,31800
1564,31807,31824,37,31854
1564,31828,31831,4,31854
1564,31981,31989,150,32019
1564,32003,32008,14,32019
答案 0 :(得分:0)
完全删除for循环并将某些内容并行化可能会有所帮助。
例如,您可以只使用...
,而不是遍历每一行来查找NAnaMatches <- is.na(sample_df[, "days_since_last_discharge"])
这将为我们提供所有NA所在的索引。因此,我们可以并行完成下一个操作。
sample_df[naMatches, "window_end_dt"] <- sample_df[naMatches, "discharge_dt"] + window_size
之后,主要技巧似乎是如何并行完成last_window_dt
。听起来你只想重复几次矩阵运算。也许类似下面的过程会起作用吗?
#Fill all the NA values with our first guess of window_end_dt, which is just what the NAs are.
sample_df$window_end_dt <- zoo::na.locf(sample_df$window_end_dt)
#Give the value of the window_end_dt prior to all those where admit_dt is less
sample_df$window_end_dt[sample_df$admit_dt <= sample_df$window_end_dt] <- dplyr::lead(sample_df$window_end_dt, 1)[sample_df$admit_dt <= sample_df$window_end_dt]
#Give the value of discharge_dt to all those where admit_dt is greater
sample_df$window_end_dt[sample_df$admit_dt > sample_df$window_end_dt] <- sample_df$discharge_dt[sample_df$admit_dt > sample_df$window_end_dt] + window_size
我没有过多地考虑过这个问题,但是如果你改变了你的流程以便进行全帧检查,那么你可以查看这些内容,直到找到window_end_dt
列中的值为止。没有改变了。如果您最终需要遍历此次数超过您拥有的行数,那么这将是无用的。
总之,我的答案是,对于window_end_dt
,只要该过程依赖于先前的值,即每个后续行以某种动态方式依赖于最后一行,我们可能需要某种形式循环,此时过程需要改变以使其工作。
其他人可能会有更聪明的答案,但如果没有,不要把我写的代码部分作为绝对代码,它可能不会按你的要求运行100,但希望这是你尝试的一个很好的起点并想出一个更快的方法。