这实际上是another question I posted previously的改编。我希望其他用户'可以提供有关我的代码的反馈或提供更好的替代品。谢谢!
我有一个包含实验室测试的数据集,我只想要从第一次实验室测试开始365天(+/- 90天)的实验室测试。
PATIENT_ID == 1
。PATIENT_ID == 2
,30/05/2016
和01/08/2016
都在距离第一个日期365 +/- 90天的范围内,但只选择后者,因为它更接近365天标记。选择第三个日期27/07/2017
是因为它在第二个日期的365 +/- 90天范围内,依此类推。数据:
PATIENT_ID LAB_TEST_DATE LAB_TEST
1: 1 2012-11-19 31
2: 1 2012-11-21 30
3: 1 2012-11-23 31
4: 1 2012-11-26 30
5: 1 2012-11-28 30
6: 1 2012-12-01 30
7: 1 2012-12-05 29
8: 1 2012-12-06 30
9: 2 2015-07-23 43
10: 2 2015-08-05 41
11: 2 2015-08-19 44
12: 2 2015-09-02 41
13: 2 2015-09-30 40
14: 2 2015-12-23 45
15: 2 2016-03-16 46
16: 2 2016-05-30 40
17: 2 2016-08-01 46
18: 2 2017-07-27 44
19: 2 2018-10-15 49
20: 3 2011-08-11 30
...trunc...
期望输出:
PATIENT_ID LAB_TEST_DATE LAB_TEST
1 19/11/2012 31
2 23/07/2015 43
2 01/08/2016 46
2 27/07/2017 44
2 15/10/2018 49
3 11/08/2011 30
3 13/08/2012 36
4 01/10/2014 41
4 26/08/2015 42
输入数据:
df <- structure(list(PATIENT_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L),
LAB_TEST_DATE = structure(c(15663, 15665, 15667, 15670, 15672,
15675, 15679, 15680, 16639, 16652, 16666, 16680, 16708, 16792,
16876, 16951, 17014, 17374, 17819, 15197, 15202, 15217, 15300,
15335, 15357, 15405, 15413, 15434, 15453, 15565, 16344, 16352,
16364, 16379, 16414, 16442, 16505, 16589, 16673), class = "Date"),
LAB_TEST = c(31L, 30L, 31L, 30L, 30L, 30L, 29L, 30L, 43L,
41L, 44L, 41L, 40L, 45L, 46L, 40L, 46L, 44L, 49L, 30L, 31L,
34L, 34L, 36L, 36L, 33L, 36L, 33L, 35L, 36L, 41L, 43L, 43L,
40L, 39L, 42L, 40L, 40L, 42L)), class = "data.frame", .Names = c("PATIENT_ID",
"LAB_TEST_DATE", "LAB_TEST"), row.names = c(NA, -39L))
代码:
我写了一个递归函数,如果日期在范围内并且最接近365天标记,那么我将过滤该日期。
f <- function(d, ind = 1) {
datediff <- difftime(d, d[ind], units = "days")
ind.range <- which(datediff >= 275 & datediff <= 455)
ind.min <- which.min(abs(datediff - 365))
ind.next <- first(intersect(ind.range, ind.min))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d, ind.next)))
}
df %>% group_by(PATIENT_ID) %>% slice(f(LAB_TEST_DATE))
答案 0 :(得分:0)
以下是使用data.table
的解决方案。内联说明。
library(data.table)
setDT(df)
#extract the first visit for each patient
firstDates <- df[, .SD[1L], by=PATIENT_ID]
#create the period for each lab test
df[, ':=' (STARTDATE=LAB_TEST_DATE+365-90, ENDDATE=LAB_TEST_DATE+365+90)]
#for each lab test, find the lab tests that are within 365 +/- 90 days
#after that lab test by performing a non-equi self join
withinPeriod <- df[
df,
.(PATIENT_ID, x.LAB_TEST, x.LAB_TEST_DATE, i.LAB_TEST_DATE, i.STARTDATE, i.ENDDATE, i.LAB_TEST),
by=.EACHI,
on=.(PATIENT_ID, LAB_TEST_DATE >= STARTDATE, LAB_TEST_DATE <= ENDDATE)][
!is.na(x.LAB_TEST), -3L:-1L]
#find the lab test that is closest to the 365 days after that lab test
#and extract only relevant columns
selected <- withinPeriod[, .SD[which.min(abs(i.LAB_TEST_DATE + 365 - x.LAB_TEST_DATE))],
by=.(PATIENT_ID, i.LAB_TEST_DATE, i.STARTDATE, i.ENDDATE, i.LAB_TEST)][,
.(PATIENT_ID, LAB_TEST_DATE=x.LAB_TEST_DATE, LAB_TEST=x.LAB_TEST)]
#cbind first dates with those selected
ans <- rbindlist(list(firstDates, unique(selected)), use.names=TRUE)
setorder(ans, PATIENT_ID, LAB_TEST_DATE)
ans
# PATIENT_ID LAB_TEST_DATE LAB_TEST
#1: 1 2012-11-19 31
#2: 2 2015-07-23 43
#3: 2 2016-08-01 46
#4: 2 2017-07-27 44
#5: 2 2018-10-15 49
#6: 3 2011-08-11 30
#7: 3 2012-08-13 36
#8: 4 2014-10-01 41
#9: 4 2015-08-26 42