我正在尝试计算12个月滚动期中的积极事件的数量。
我每年可以创建365行丢失的数据,并使用zoo::rollapply
对每365行数据的事件总数求和,但是我的数据框架确实很大,我想对很多变量进行此操作,因此这需要永远运行。
我可以用以下方法获得正确的输出:
data <- data.frame(id = c("a","a","a","a","a","b","b","b","b","b"),
date = c("20-01-2011","20-04-2011","20-10-2011","20-02-2012",
"20-05-2012","20-01-2013","20-04-2013","20-10-2013",
"20-02-2014","20-05-2014"),
event = c(0,1,1,1,0,1,0,0,1,1))
library(lubridate)
library(dplyr)
library(tidyr)
library(zoo)
data %>%
group_by(id) %>%
mutate(date = dmy(date),
cumsum = cumsum(event)) %>%
complete(date = full_seq(date, period = 1), fill = list(event = 0)) %>%
mutate(event12 = rollapplyr(event, width = 365, FUN = sum, partial = TRUE)) %>%
drop_na(cumsum)
这是什么:
id date event cumsum event12
<fct> <date> <dbl> <dbl> <dbl>
a 2011-01-20 0 0 0
a 2011-04-20 1 1 1
a 2011-10-20 1 2 2
a 2012-02-20 1 3 3
a 2012-05-20 0 3 2
b 2013-01-20 1 1 1
b 2013-04-20 0 1 1
b 2013-10-20 0 1 1
b 2014-02-20 1 2 1
b 2014-05-20 1 3 2
但是想看看是否有更有效的方法,例如我将如何使rollyapply
中的宽度计算日期而不是计算行数。
答案 0 :(得分:0)
在将日期转换为Date
类后,无需使用复杂的自连接和单个sql语句即可填写缺少的日期:
library(sqldf)
data2 <- transform(data, date = as.Date(date, "%d-%m-%Y"))
sqldf("select a.*, sum(b.event) as event12
from data2 as a
left join data2 as b on a.id = b.id and b.date between a.date - 365 and a.date
group by a.rowid
order by a.rowid")
给予:
id date event event12
1 a 2011-01-20 0 0
2 a 2011-04-20 1 1
3 a 2011-10-20 1 2
4 a 2012-02-20 1 3
5 a 2012-05-20 0 2
6 b 2013-01-20 1 1
7 b 2013-04-20 0 1
8 b 2013-10-20 0 1
9 b 2014-02-20 1 1
10 b 2014-05-20 1 2