我有一个包含3列的10M行数据集:日期,变量var1和ID。我试图计算过去3天的var1滚动平均值,不包括当天。
这只是我的数据框中的一小部分内容:
date var1 ID
<date> <dbl> <int>
1 2010-01-04 -0.124 10371
2 2010-01-05 -0.162 10371
3 2011-11-25 NaN 13011
4 2016-11-10 NaN 16350
5 2016-11-11 -1.000 16350
6 2016-12-13 1.000 16350
7 2016-12-30 1.000 16517
8 2016-12-27 0.366 16524
structure(list(date = structure(c(14613, 14614, 15303, 17115,
17116, 17148, 17165, 17162), class = "Date"), var1 = c(-0.124,
-0.162, NaN, NaN, -1, 1, 1, 0.366), ID = c(10371L,
10371L, 13011L, 16350L, 16350L, 16350L, 16517L, 16524L)), .Names = c("date",
"var1", "ID"), row.names = c(NA, -8L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = "ID", drop = TRUE, indices = list(
0:1, 2L, 3:5, 6L, 7L), group_sizes = c(2L, 1L, 3L, 1L, 1L
), biggest_group_size = 3L, labels = structure(list(ID = c(10371L,
13011L, 16350L, 16517L, 16524L)), row.names = c(NA, -5L), class = "data.frame",
vars = "ID", drop = TRUE, .Names = "ID"))
我的代码使用dplyr和rollapplyr,如下所示:
library(dplyr)
library(zoo)
newdf = df %>% group_by(ID) %>% mutate(var1.lag1 = lag(var1, n = 1)) %>%
mutate(avgvar1.3d = rollapplyr(data = var1.lag1,width = 3,FUN = mean,
align = "right",na.rm = T))
我希望在滚动窗口的大小(在这种情况下为3)小于组中的观察数量的情况下获得NA。但是,我正在努力应对以下错误:
Error in mutate_impl(.data, dots) :
Evaluation error: wrong sign in 'by' argument.
任何帮助都将受到高度赞赏。
答案 0 :(得分:1)
您似乎需要加入partial = T
。修改完rollapplyr
后,结果如下所示。
newdf = df %>% group_by(ID) %>% mutate(var1.lag1 = lag(var1, n = 1)) %>%
mutate(avgvar1.3d = rollapplyr(data = var1.lag1,width = 3,FUN = mean, partial = TRUE,
align = "right",na.rm = T))
newdf
# A tibble: 8 x 5
# Groups: ID [5]
date var1 ID var1.lag1 avgvar1.3d
<date> <dbl> <int> <dbl> <dbl>
1 2010-01-04 - 0.124 10371 NA NaN
2 2010-01-05 - 0.162 10371 - 0.124 - 0.124
3 2011-11-25 NaN 13011 NA NaN
4 2016-11-10 NaN 16350 NA NaN
5 2016-11-11 - 1.00 16350 NaN NaN
6 2016-12-13 1.00 16350 - 1.00 - 1.00
7 2016-12-30 1.00 16517 NA NaN
8 2016-12-27 0.366 16524 NA NaN