R - 在数据帧中创建条件和作为新列的更快方法

时间:2018-02-22 18:03:23

标签: r dplyr

我最初的问题

我有一个医院就诊数据框,如下所示:

df = data.frame(PNUM = c(1,1,1,1,2,2,2,2),
                indate=as.Date(c("2016-01-03","2016-05-05","2017-02-03",
                                 "2017-06-07","2016-01-03","2016-05-05",
                                 "2017-02-03","2017-06-07")),
                Inpatient=c(0,1,0,1,1,1,1,0),
                AnE=c(1,0,1,0,0,0,0,1))

输出:

  PNUM     indate Inpatient AnE
1    1 2016-01-03         0   1
2    1 2016-05-05         1   0
3    1 2017-02-03         0   1
4    1 2017-06-07         1   0
5    2 2016-01-03         1   0
6    2 2016-05-05         1   0
7    2 2017-02-03         1   0
8    2 2017-06-07         0   1

我现在想要添加反映" Inpatient"的数量的列。和#34; AnE"在当前" indate"之前的365天内访问。期望的结果如下所示:

  PNUM     indate Inpatient AnE sum_365_Inpatient sum_365_AnE
1    1 2016-01-03         0   1                 0           0
2    1 2016-05-05         1   0                 0           1
3    1 2017-02-03         0   1                 1           0
4    1 2017-06-07         1   0                 0           1
5    2 2016-01-03         1   0                 0           0
6    2 2016-05-05         1   0                 1           0
7    2 2017-02-03         1   0                 1           0
8    2 2017-06-07         0   1                 1           0

我找到了一种方法(见下文),但它非常慢(1个新行,10,000行,约4分钟)。我的原始数据框有2个mio行和> 100个列,我想为其创建这些总和。我对R来说比较新,并通过将几个类似问题的东西放在一起来创建以下解决方案。我猜它不是很有效率。对于如何改进我的代码的任何建议,我将不胜感激。

这是我非常低效的解决方案

我首先定义一个函数来计算回顾X天的特定列的总和(另外由ID限制,因为我只想要来自同一个人的事件)

# Function definition

hist_sum = function(colname,ID,date_input,x) { 
    # window start and end
    window_start = date_input - x
    window_end = date_input
    # Calculate sum within window
    sum(df[(df$PNUM == ID) & (df$indate >= window_start) &
           (df$indate < window_end),c(colname)])
}


# Vectorise function

hist_sum = Vectorize(hist_sum)

然后我使用for循环和dplyr的mutate函数来计算&#34; Inpatient&#34;和#34; AnE&#34;列,使用PNUM作为ID,indate =作为事件日期,365天的窗口(并为每个窗口创建一个唯一的列名称):

library(dplyr)

for (i in c("Inpatient","AnE")) {
    # Generate column title
    coltitle = paste("sum",as.character(j),i,sep="_")
    # Apply 
    df = mutate(df, !!coltitle := hist_sum(i,PNUM,indate,365))
}

1 个答案:

答案 0 :(得分:4)

非equi连接旨在实现此目的。

对于互斥虚拟列的情况......

首先,一些设置...

# go to long form

library(data.table)
DT = melt(setDT(df), id=c("PNUM", "indate"), variable.name = "status")[value == 1, !"value"]
setorder(DT, PNUM, indate)

# use integer dates

DT[, indate := as.IDate(indate)]


   PNUM     indate    status
1:    1 2016-01-03       AnE
2:    1 2016-05-05 Inpatient
3:    1 2017-02-03       AnE
4:    1 2017-06-07 Inpatient
5:    2 2016-01-03 Inpatient
6:    2 2016-05-05 Inpatient
7:    2 2017-02-03 Inpatient
8:    2 2017-06-07       AnE

算上他们

for (s in unique(DT$status)){
  DT[, paste0("n365_", s) := 
    .SD[status == s][.SD[, .(PNUM, d_dn = indate - 365L, d_up = indate)], 
      on=.(PNUM, indate >= d_dn, indate < d_up),
      .N, by=.EACHI]$N
 ][]
}

   PNUM     indate    status n365_AnE n365_Inpatient
1:    1 2016-01-03       AnE        0              0
2:    1 2016-05-05 Inpatient        1              0
3:    1 2017-02-03       AnE        0              1
4:    1 2017-06-07 Inpatient        1              0
5:    2 2016-01-03 Inpatient        0              0
6:    2 2016-05-05 Inpatient        0              1
7:    2 2017-02-03 Inpatient        0              1
8:    2 2017-06-07       AnE        0              1

工作原理。写得更详细:

for (s in unique(DT$status)){
  DT[, paste0("n365_", s) := {

    # define the ranges we are interested in
    look_these_up = .SD[, .(PNUM, d_dn = indate - 365L, d_up = indate)]

    # define where we are looking
    look_in_here = .SD[status == s]

    # do the lookup
    # counting rows of look_in_here (.N)
    look_in_here[look_these_up, on=.(PNUM, indate >= d_dn, indate < d_up),
      .N, by=.EACHI]$N
 }][]
}

data.table联接的语法是x[i, on=, j],我们使用on=规则在i中查找x的每一行,然后执行{{1} }}。有关详细信息,请参阅j

对于可能重叠虚拟列的情况......

OP在评论中提出了这种可能性。在这种情况下,我们不能长篇大论并崩溃到一个“状态”列。

?data.table

通常,对于整数而言,连接/查找比浮点数更快,这就是在此处完成转换的原因。 library(data.table) DT = data.table(df) mycols = setdiff(names(DT), c("PNUM", "indate")) # use integer dates DT[, indate := as.IDate(indate)] # use integer dummies DT[, (mycols) := lapply(.SD, as.integer), .SDcols=mycols] DT[, paste0("n365_", mycols) := { # define the ranges we are interested in look_these_up = DT[, .(PNUM, d_dn = indate - 365L, d_up = indate)] lapply(mycols, function(s){ # define where we are looking look_in_here = .SD[get(s) == 1L] # do the lookup, counting rows of look_in_here (.N) look_in_here[look_these_up, on=.(PNUM, indate >= d_dn, indate < d_up), .N, by=.EACHI]$N }) }][] lapply循环方式是等效的,但for方式仅涉及构建lapply一次,因此可能更快。