Question

希望尽快使此过程更快。我有两个非常大的数据框（下面提供了简短的示例）。

df1是一个较短的数据框，其中每一行代表一个病人的住院情况。 eid是每次hospitalisation，pid是患者id，doa是日期整数

library(tidyverse)
library(data.table)
library(purrr)

eid <- seq(1,4,1)
pid <- c(rep(111,2),rep(222,1),333)
doa <- as.numeric(c(1500,1100,600,200))
df1 <- as_tibble(cbind(eid,pid,doa))

df2通常是一个较长的数据框，其中每一行代表一个特定的药物处方。 pid是与df1中的相同pid匹配的患者ID。药物是处方药的类型。 dop是指定为整数的日期。

pid <- c(rep(111,2),rep(222,3))
drug <- c('a','a','b','c','a')
dop <- as.numeric(c(550,900,950,1000,500))
df2 <- as_tibble(cbind(pid,drug,dop))

实际上，我想要为df1附加的每种药物创建一列。我已经显示了药物“ a”的示例，如下所示：

df2 <- df2 %>% 
filter(drug=='a')

drug <- pmap(list(df1$pid,df1$doa),
function (x,y)
list(case_when(
#id match
df2$pid==x &  y-as.numeric(df2$dop) < 365 &y-as.numeric(df2$dop) > 0 ~1,
#id match and drug discharge <365 days
T ~ 0)
))

drug

dat <- data.table(matrix(unlist(drug),nrow=dim(df1)[1],byrow = T))

fun1 <- function (x) ifelse(x==1,T,F)

dat <- dat[,drug_a:=Reduce('|',lapply(.SD, fun1)), .SDcols = 1:3]

我想要的最终结果是一个看起来像

的数据框

df1 <- cbind(df1,dat[,'drug_a'])

但是对于drug_a，drug_b，drug_c等

df1有40万行，但df2有2亿行

是否有比我上面描述的过程更快，更有效的过程？

谢谢

Answer 1

如果我正确理解了您的问题，可以尝试一下。它不在data.table中，但是您基本上只是在尝试避免循环。没有原始数据很难对这类东西进行基准测试，因为不是每个功能的缩放比例都一样，但是我希望它比您目前的方法更快，更干净，并且可以根据需要扩展到尽可能多的药物。

library(tidyverse)
df1 <- tibble(
  eid = seq(1, 4, 1),
  pid = c(rep(111, 2), rep(222, 1), 333),
  doa = as.numeric(c(1500, 1100, 600, 200))
)
df2 <- tibble(
  pid = c(rep(111, 2), rep(222, 3)),
  drug = c("a", "a", "b", "c", "a"),
  dop = as.numeric(c(550, 900, 950, 1000, 500))
)

df1 %>%
  left_join(df2, by = "pid") %>% # one row per patient-hospitaliation-drug-prescription
  mutate(days_since_prescription = doa - dop) %>%
  group_by(eid, pid, drug) %>%
  summarise(within_365 = any(days_since_prescription < 365 & days_since_prescription > 0)) %>% 
  ungroup() %>% # now one row per patient-hospitalisation-drug
  spread(drug, within_365, sep = "_") %>% # now one row per patient-hospitalisation
  select(-drug_NA) %>% # cleanup tasks
  mutate_at(vars(starts_with("drug")), replace_na, replace = FALSE) # skip this if you want to preserve knowing whether a prescription took place
#> # A tibble: 4 x 5
#>     eid   pid drug_a drug_b drug_c
#>   <dbl> <dbl> <lgl>  <lgl>  <lgl> 
#> 1     1   111 FALSE  FALSE  FALSE 
#> 2     2   111 TRUE   FALSE  FALSE 
#> 3     3   222 TRUE   FALSE  FALSE 
#> 4     4   333 FALSE  FALSE  FALSE

由reprex package（v0.2.0）于2018-09-28创建。

用data.table替代pmap并减少？

1 个答案: