Question

我有两个带有入院日期的医院入院（admission）和带有测试日期的实验室结果（test）的数据集。患者具有个人ID（patient_id），每次入院都有自己的入院ID（admission_id）。实验室测试数据集仅包含患者ID。一些可重现的示例数据：

admission <- data.frame(
  patient_id = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
  admission_id = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2),
  start_date = as.Date(
    c(
      "2010-10-22",
      "2013-04-30",
      "2009-02-08", 
      "2015-12-12",
      "2013-01-08", 
      "2015-02-27",
      "2009-08-02",
      "2011-12-19",
      "2011-09-02",
      "2016-05-25"
    )
    ),
  end_date = as.Date(
    c(
      "2010-10-23", 
      "2013-05-03",
      "2009-02-12",
      "2015-12-12",
      "2013-01-15",
      "2015-02-27",
      "2009-08-06",
      "2011-12-26",
      "2011-09-06",
      "2016-05-31"
    )
  )
  )

test <- data.frame(
  patient_id = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
  test_date = as.Date(
    c(
      "2010-10-23",
      "2013-04-01",
      "2009-02-08",
      "2015-12-12",
      "2013-06-01",
      "2015-02-28",
      "2009-10-08",
      "2011-12-21",
      "2011-09-02",
      "2016-05-26"
    )
  )
)

面临的挑战是还要将（admission_id）分配给测试数据以创建真正的唯一标识符。到目前为止，我的方法是使用dplyr::left_join来patient_id，并使用filter(test_date %within% interval(start_date, end_date)包来lubridate。

library(dplyr)
data <- test %>% left_join(admission)

library(lubridate)
data %>% filter(test_date %within% interval(start_date, end_date))

结果：

  patient  test_date admission_id start_date   end_date
1       a 2010-10-23            1 2010-10-22 2010-10-23
2       b 2009-02-08            1 2009-02-08 2009-02-12
3       b 2015-12-12            2 2015-12-12 2015-12-12
4       d 2011-12-21            2 2011-12-19 2011-12-26
5       e 2011-09-02            1 2011-09-02 2011-09-06
6       e 2016-05-26            2 2016-05-25 2016-05-31

对于这个小例子来说，这很好用，但是对于更大的数据集（> 100,000行/观察），它变得非常慢。

有什么想法可以通过不同的方法来加快速度吗？

Answer 1

使用data.table的边框-对于大型数据对象来说这是快速的方法：

> # here is a solution using the 'foverlaps' function in 'data.table'
> library(data.table)

> admission <- data.frame(
+   patient_id = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
+   admission_id = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2),
+ .... [TRUNCATED] 

> test <- data.frame(
+   patient_id = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
+   test_date = as.Date(
+     c(
+       "2010-10-23",
+  .... [TRUNCATED] 

> # add dummy dates to test after making data.tables
> setDT(admission)

> setDT(test)

> test[, `:=`(start_date = test_date, end_date = test_date)]

> setkey(admission, start_date, end_date)  # set the key that is required

> foverlaps(test, admission)[
+   !is.na(patient_id)][,  # remove non-matches
+     `:=`(i.patient_id = NULL, i.start_date = NULL, i.end_date = NULL)] .... [TRUNCATED] 
   patient_id admission_id start_date   end_date  test_date
1:          a            1 2010-10-22 2010-10-23 2010-10-23
2:          b            1 2009-02-08 2009-02-12 2009-02-08
3:          b            2 2015-12-12 2015-12-12 2015-12-12
4:          d            2 2011-12-19 2011-12-26 2011-12-21
5:          e            1 2011-09-02 2011-09-06 2011-09-02
6:          e            2 2016-05-25 2016-05-31 2016-05-26
>

加快时间间隔过滤-R

1 个答案: