之前我问了一个类似的问题,得到了很大的帮助:R: Aggregating History By ID By Date
不同之处在于,对于前一篇文章,我有兴趣汇总所有历史信息,但现在我希望仅提前90天指定。
以下是我的数据外观的示例:
strDates <- c("09/09/16", "5/7/16", "5/6/16", "2/13/16", "2/11/16","1/7/16",
"11/8/16","6/8/16", "5/8/16","2/13/16","1/3/16", "1/1/16")
Date<-as.Date(strDates, "%m/%d/%y")
ID <- c("A", "A", "A", "A","A", "A", "B","B","B","B","B", "B")
Event <- c(1,0,1,0,1,1, 0,1,1,1,0, 1)
sample_df <- data.frame(Date,ID,Event)
以及输出:
背景资料
我想在每次遭遇时保留所有附加信息,但随后将ID的以下历史信息汇总回90天。
答案 0 :(得分:9)
这是一个非常有效的替代data.table
解决方案。这利用了v 1.10.0中引入的新非equi 连接与by = .EACHI
相结合,允许您在加入
library(data.table) #v1.10.0
setDT(sample_df)[, Date2 := Date - 90] # Set range (Maybe in future this could be avoided)
sample_df[sample_df, # Binary join with itself
.(Enc90D = .N, Ev90D = sum(Event, na.rm = TRUE)), # Make calculations
on = .(ID = ID, Date < Date, Date > Date2), # Join by
by = .EACHI] # Do calculations per each match
# ID Date Date Enc90D Ev90D
# 1: A 2016-09-09 2016-06-11 0 0
# 2: A 2016-05-07 2016-02-07 3 2
# 3: A 2016-05-06 2016-02-06 2 1
# 4: A 2016-02-13 2015-11-15 2 2
# 5: A 2016-02-11 2015-11-13 1 1
# 6: A 2016-01-07 2015-10-09 0 0
# 7: B 2016-11-08 2016-08-10 0 0
# 8: B 2016-06-08 2016-03-10 1 1
# 9: B 2016-05-08 2016-02-08 1 1
# 10: B 2016-02-13 2015-11-15 2 1
# 11: B 2016-01-03 2015-10-05 1 1
# 12: B 2016-01-01 2015-10-03 0 0
答案 1 :(得分:2)
部分向量化的dplyr
解决方案,您可以将do
(循环组)和rowwise
操作组合在一起(这样您就可以将日期作为日期引用到每一行,以及.$Date
作为每个组中的整个Date
列:
sample_df %>%
group_by(ID) %>%
do(rowwise(.) %>%
mutate(PrevEnc90D = sum(Date - .$Date < 90 & Date - .$Date > 0),
PrevEvent90D = sum(.$Event[Date - .$Date < 90 & Date - .$Date > 0])))
#Source: local data frame [12 x 5]
#Groups: ID [2]
# Date ID Event PrevEnc90D PrevEvent90D
# <date> <fctr> <dbl> <int> <dbl>
#1 2016-09-09 A 1 0 0
#2 2016-05-07 A 0 3 2
#3 2016-05-06 A 1 2 1
#4 2016-02-13 A 0 2 2
#5 2016-02-11 A 1 1 1
#6 2016-01-07 A 1 0 0
#7 2016-11-08 B 0 0 0
#8 2016-06-08 B 1 1 1
#9 2016-05-08 B 1 1 1
#10 2016-02-13 B 1 2 1
#11 2016-01-03 B 0 1 1
#12 2016-01-01 B 1 0 0
答案 2 :(得分:2)
一个相当冗长的dplyr解决方案,它使用的行数比真正需要的多。我们的想法是为每个日期创建一个完全连接的表,然后使用窗口函数。如果需要不同的窗口计算,这可能很有用。
ERROR [app-router] Error: Error invoking SlickService. Check the inner error for details.
------------------------------------------------
Inner Error:
Message: key/value cannot be null or undefined. Are you trying to inject/register something that doesn't exist with DI?
来源:本地数据框[12 x 6] 组:ID [2]
library(dplyr)
dates <- data.frame(Date = seq(from = -90 + min(sample_df$Date), to = max(sample_df$Date), by=1))
extended_df <- data.frame(ID = unique(sample_df$ID)) %>%
merge(dates) %>%
left_join(sample_df, by=(c("ID", "Date"))) %>%
arrange(ID, desc(Date)) %>%
mutate(Encounter = as.integer(!is.na(Event)),
Event = ifelse(is.na(Event), 0, Event)) %>%
group_by(ID) %>%
mutate(PrevEnc90D = rollsum(lead(Encounter), k=90, fill=0, align="left"),
PrevEvent90D = rollsum(lead(Event), k=90, fill=0, align="left")) %>%
inner_join(sample_df[,c("ID", "Date")]) %>%
arrange(ID, desc(Date))
extended_df
答案 3 :(得分:1)
另一个想法是尽可能避免重复求和和关系运算:
do.call(rbind,
lapply(split(sample_df, sample_df$ID),
function(x) {
i = nrow(x) - findInterval(x$Date - 90, rev(x$Date))
cs = cumsum(x$Event)
cbind(x, PrevEnc90D = i - (1:nrow(x)), PrevEvent90D = cs[i] - cs)
}))
# Date ID Event PrevEnc90D PrevEvent90D
#A.1 2016-09-09 A 1 0 0
#A.2 2016-05-07 A 0 3 2
#A.3 2016-05-06 A 1 2 1
#A.4 2016-02-13 A 0 2 2
#A.5 2016-02-11 A 1 1 1
#A.6 2016-01-07 A 1 0 0
#B.7 2016-11-08 B 0 0 0
#B.8 2016-06-08 B 1 1 1
#B.9 2016-05-08 B 1 1 1
#B.10 2016-02-13 B 1 2 1
#B.11 2016-01-03 B 0 1 1
#B.12 2016-01-01 B 1 0 0
以上假设&#34;日期&#34;在每个&#34; ID&#34;内逐渐减少排序。 (如果不是这样的话,这是非常简单的)。这里的主要思想是(i)找到每个日期的前90天,(ii)计算一次和前期的累积和,以及(iii)减去相应的指数/ cumsum
以获得输出。我在这里使用了split
/ lapply
路由来按&#34; ID&#34;进行分组,但我想,它很容易转移到任何工具上。