我最初的问题
我有一个医院就诊数据框,如下所示:
df = data.frame(PNUM = c(1,1,1,1,2,2,2,2),
indate=as.Date(c("2016-01-03","2016-05-05","2017-02-03",
"2017-06-07","2016-01-03","2016-05-05",
"2017-02-03","2017-06-07")),
Inpatient=c(0,1,0,1,1,1,1,0),
AnE=c(1,0,1,0,0,0,0,1))
输出:
PNUM indate Inpatient AnE
1 1 2016-01-03 0 1
2 1 2016-05-05 1 0
3 1 2017-02-03 0 1
4 1 2017-06-07 1 0
5 2 2016-01-03 1 0
6 2 2016-05-05 1 0
7 2 2017-02-03 1 0
8 2 2017-06-07 0 1
我现在想要添加反映" Inpatient"的数量的列。和#34; AnE"在当前" indate"之前的365天内访问。期望的结果如下所示:
PNUM indate Inpatient AnE sum_365_Inpatient sum_365_AnE
1 1 2016-01-03 0 1 0 0
2 1 2016-05-05 1 0 0 1
3 1 2017-02-03 0 1 1 0
4 1 2017-06-07 1 0 0 1
5 2 2016-01-03 1 0 0 0
6 2 2016-05-05 1 0 1 0
7 2 2017-02-03 1 0 1 0
8 2 2017-06-07 0 1 1 0
我找到了一种方法(见下文),但它非常慢(1个新行,10,000行,约4分钟)。我的原始数据框有2个mio行和> 100个列,我想为其创建这些总和。我对R来说比较新,并通过将几个类似问题的东西放在一起来创建以下解决方案。我猜它不是很有效率。对于如何改进我的代码的任何建议,我将不胜感激。
这是我非常低效的解决方案
我首先定义一个函数来计算回顾X天的特定列的总和(另外由ID限制,因为我只想要来自同一个人的事件)
# Function definition
hist_sum = function(colname,ID,date_input,x) {
# window start and end
window_start = date_input - x
window_end = date_input
# Calculate sum within window
sum(df[(df$PNUM == ID) & (df$indate >= window_start) &
(df$indate < window_end),c(colname)])
}
# Vectorise function
hist_sum = Vectorize(hist_sum)
然后我使用for循环和dplyr的mutate函数来计算&#34; Inpatient&#34;和#34; AnE&#34;列,使用PNUM作为ID,indate =作为事件日期,365天的窗口(并为每个窗口创建一个唯一的列名称):
library(dplyr)
for (i in c("Inpatient","AnE")) {
# Generate column title
coltitle = paste("sum",as.character(j),i,sep="_")
# Apply
df = mutate(df, !!coltitle := hist_sum(i,PNUM,indate,365))
}
答案 0 :(得分:4)
非equi连接旨在实现此目的。
对于互斥虚拟列的情况......
首先,一些设置...
# go to long form
library(data.table)
DT = melt(setDT(df), id=c("PNUM", "indate"), variable.name = "status")[value == 1, !"value"]
setorder(DT, PNUM, indate)
# use integer dates
DT[, indate := as.IDate(indate)]
PNUM indate status
1: 1 2016-01-03 AnE
2: 1 2016-05-05 Inpatient
3: 1 2017-02-03 AnE
4: 1 2017-06-07 Inpatient
5: 2 2016-01-03 Inpatient
6: 2 2016-05-05 Inpatient
7: 2 2017-02-03 Inpatient
8: 2 2017-06-07 AnE
算上他们
for (s in unique(DT$status)){
DT[, paste0("n365_", s) :=
.SD[status == s][.SD[, .(PNUM, d_dn = indate - 365L, d_up = indate)],
on=.(PNUM, indate >= d_dn, indate < d_up),
.N, by=.EACHI]$N
][]
}
PNUM indate status n365_AnE n365_Inpatient
1: 1 2016-01-03 AnE 0 0
2: 1 2016-05-05 Inpatient 1 0
3: 1 2017-02-03 AnE 0 1
4: 1 2017-06-07 Inpatient 1 0
5: 2 2016-01-03 Inpatient 0 0
6: 2 2016-05-05 Inpatient 0 1
7: 2 2017-02-03 Inpatient 0 1
8: 2 2017-06-07 AnE 0 1
工作原理。写得更详细:
for (s in unique(DT$status)){
DT[, paste0("n365_", s) := {
# define the ranges we are interested in
look_these_up = .SD[, .(PNUM, d_dn = indate - 365L, d_up = indate)]
# define where we are looking
look_in_here = .SD[status == s]
# do the lookup
# counting rows of look_in_here (.N)
look_in_here[look_these_up, on=.(PNUM, indate >= d_dn, indate < d_up),
.N, by=.EACHI]$N
}][]
}
data.table联接的语法是x[i, on=, j]
,我们使用on=
规则在i
中查找x
的每一行,然后执行{{1} }}。有关详细信息,请参阅j
。
对于可能重叠虚拟列的情况......
OP在评论中提出了这种可能性。在这种情况下,我们不能长篇大论并崩溃到一个“状态”列。
?data.table
通常,对于整数而言,连接/查找比浮点数更快,这就是在此处完成转换的原因。 library(data.table)
DT = data.table(df)
mycols = setdiff(names(DT), c("PNUM", "indate"))
# use integer dates
DT[, indate := as.IDate(indate)]
# use integer dummies
DT[, (mycols) := lapply(.SD, as.integer), .SDcols=mycols]
DT[, paste0("n365_", mycols) := {
# define the ranges we are interested in
look_these_up = DT[, .(PNUM, d_dn = indate - 365L, d_up = indate)]
lapply(mycols, function(s){
# define where we are looking
look_in_here = .SD[get(s) == 1L]
# do the lookup, counting rows of look_in_here (.N)
look_in_here[look_these_up, on=.(PNUM, indate >= d_dn, indate < d_up),
.N, by=.EACHI]$N
})
}][]
和lapply
循环方式是等效的,但for
方式仅涉及构建lapply
一次,因此可能更快。