我有两个带有入院日期的医院入院(admission
)和带有测试日期的实验室结果(test
)的数据集。患者具有个人ID(patient_id
),每次入院都有自己的入院ID(admission_id
)。实验室测试数据集仅包含患者ID。一些可重现的示例数据:
admission <- data.frame(
patient_id = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
admission_id = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2),
start_date = as.Date(
c(
"2010-10-22",
"2013-04-30",
"2009-02-08",
"2015-12-12",
"2013-01-08",
"2015-02-27",
"2009-08-02",
"2011-12-19",
"2011-09-02",
"2016-05-25"
)
),
end_date = as.Date(
c(
"2010-10-23",
"2013-05-03",
"2009-02-12",
"2015-12-12",
"2013-01-15",
"2015-02-27",
"2009-08-06",
"2011-12-26",
"2011-09-06",
"2016-05-31"
)
)
)
test <- data.frame(
patient_id = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
test_date = as.Date(
c(
"2010-10-23",
"2013-04-01",
"2009-02-08",
"2015-12-12",
"2013-06-01",
"2015-02-28",
"2009-10-08",
"2011-12-21",
"2011-09-02",
"2016-05-26"
)
)
)
面临的挑战是还要将(admission_id
)分配给测试数据以创建真正的唯一标识符。到目前为止,我的方法是使用dplyr::left_join
来patient_id
,并使用filter(test_date %within% interval(start_date, end_date)
包来lubridate
。
library(dplyr)
data <- test %>% left_join(admission)
library(lubridate)
data %>% filter(test_date %within% interval(start_date, end_date))
结果:
patient test_date admission_id start_date end_date
1 a 2010-10-23 1 2010-10-22 2010-10-23
2 b 2009-02-08 1 2009-02-08 2009-02-12
3 b 2015-12-12 2 2015-12-12 2015-12-12
4 d 2011-12-21 2 2011-12-19 2011-12-26
5 e 2011-09-02 1 2011-09-02 2011-09-06
6 e 2016-05-26 2 2016-05-25 2016-05-31
对于这个小例子来说,这很好用,但是对于更大的数据集(> 100,000行/观察),它变得非常慢。
有什么想法可以通过不同的方法来加快速度吗?
答案 0 :(得分:0)
使用data.table的边框-对于大型数据对象来说这是快速的方法:
> # here is a solution using the 'foverlaps' function in 'data.table'
> library(data.table)
> admission <- data.frame(
+ patient_id = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
+ admission_id = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2),
+ .... [TRUNCATED]
> test <- data.frame(
+ patient_id = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
+ test_date = as.Date(
+ c(
+ "2010-10-23",
+ .... [TRUNCATED]
> # add dummy dates to test after making data.tables
> setDT(admission)
> setDT(test)
> test[, `:=`(start_date = test_date, end_date = test_date)]
> setkey(admission, start_date, end_date) # set the key that is required
> foverlaps(test, admission)[
+ !is.na(patient_id)][, # remove non-matches
+ `:=`(i.patient_id = NULL, i.start_date = NULL, i.end_date = NULL)] .... [TRUNCATED]
patient_id admission_id start_date end_date test_date
1: a 1 2010-10-22 2010-10-23 2010-10-23
2: b 1 2009-02-08 2009-02-12 2009-02-08
3: b 2 2015-12-12 2015-12-12 2015-12-12
4: d 2 2011-12-19 2011-12-26 2011-12-21
5: e 1 2011-09-02 2011-09-06 2011-09-02
6: e 2 2016-05-25 2016-05-31 2016-05-26
>