使用dplyr基于动态窗口计算统计数据

时间:2017-03-22 19:09:37

标签: r dplyr

我试图在R中使用dplyr根据基于日期和特定模型的动态窗口计算滚动统计数据(均值,sd等)。例如,在项目分组中,我想计算10天前所有数据的滚动平均值。数据上的日期不是连续的,也不是完整的,所以我不能使用固定的窗口。

执行此操作的一种方法是使用rollapply引用窗口宽度,如下所示。但是,我在计算动态宽度时遇到了麻烦。我更喜欢省略计算窗口的中间步骤的方法,并简单地根据date_lookback进行计算。这是一个玩具示例。

我已经用于循环来做到这一点,但它们非常慢。

    library(dplyr)
library(zoo)

date_lookback <- 10 #days to look back for rolling calcs

df <- data.frame(label = c(rep("a",5),rep("b",5)),
                 date = as.Date(c("2017-01-02","2017-01-20",
                                  "2017-01-21","2017-01-30","2017-01-31","2017-01-05",
                                  "2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
                data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE) %>%
  mutate(.,
         cut_date = date - date_lookback, #calcs based on sample since this date
         dyn_win = c(1,1,2,3,3,1,2,3,4,5), ##!! need to calculate this vector??
         roll_mean = rollapply(data, align = "right", width = dyn_win, mean),
         roll_sd = rollapply(data, align = "right", width = dyn_win, sd))

这些是我正在寻找的roll_mean和roll_sd结果:

> df
   label       date data   cut_date dyn_win roll_mean  roll_sd
1      a 2017-01-02  790 2016-12-23       1  790.0000       NA
2      a 2017-01-20  493 2017-01-10       1  493.0000       NA
3      a 2017-01-21  718 2017-01-11       2  605.5000 159.0990
4      a 2017-01-30  483 2017-01-20       3  564.6667 132.8847
5      a 2017-01-31  825 2017-01-21       3  675.3333 174.9467
6      b 2017-01-05  186 2016-12-26       1  186.0000       NA
7      b 2017-01-08  599 2016-12-29       2  392.5000 292.0351
8      b 2017-01-09  408 2016-12-30       3  397.6667 206.6938
9      b 2017-01-10  108 2016-12-31       4  325.2500 222.3921
10     b 2017-01-11  666 2017-01-01       5  393.4000 245.5928

提前致谢。

1 个答案:

答案 0 :(得分:0)

您可以尝试在dplyr调用中显式引用数据集:

date_lookback <- 10 #days to look back for rolling calcs

df <- data.frame(label = c(rep("a",5),rep("b",5)),
                 date = as.Date(c("2017-01-02","2017-01-20",
                                  "2017-01-21","2017-01-30","2017-01-31","2017-01-05",
                                  "2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
                 data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE)

df %>%
  group_by(date,label) %>%
  mutate(.,
         roll_mean = mean(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
                                 df$data,NA),na.rm=TRUE),
         roll_sd = sd(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
                             df$data,NA),na.rm=TRUE))