从以下内容开始:
library(tidyverse)
library(lubridate)
df <- tibble(
date = seq.Date(ymd("2018-01-01"), by = "month", length.out = 6),
y = c(20, 10, 15, 35, 40, 50)
)
df
#> # A tibble: 6 x 2
#> date y
#> <date> <dbl>
#> 1 2018-01-01 20
#> 2 2018-02-01 10
#> 3 2018-03-01 15
#> 4 2018-04-01 35
#> 5 2018-05-01 40
#> 6 2018-06-01 50
我想创建一个新列z
,该列是递归滚动6周期的平均值。也就是说,对于2018-07-01
,这只是最近六条记录的平均值,但是对于2018-08-01
,我们在新的滚动计算中使用先前计算的(相关)滚动平均值。>
2018-07-01 = mean(c(20, 10, 15, 35, 40, 50)) = 28.3333 2018-08-01 = mean(c(10, 15, 35, 40, 50, 28.3333)) = 29.7222 2018-09-01 = mean(c(15, 35, 40, 50, 28.3333, 29.7222) = 33.0093 ...etc...
我已经使用tibbletime::rollify
和zoo::rollmeanr
进行了一些尝试,但都不允许我递归地引用上次计算的滚动平均值。
所需的输出:
desired_df <- tibble(
date = seq.Date(ymd("2018-01-01"), by = "month", length.out = 22),
y = c(20, 10, 15, 35, 40, 50, rep(NA, 16)),
z = c(
rep(NA, 6),
28.3333, 29.7222, 33.0093, 36.0108, 36.1793, 35.5425, 33.1329,
33.9328, 34.6346, 34.9055, 34.7213, 34.4783, 34.3009, 34.4955,
34.5893, 34.5818
)
)
desired_df
#> # A tibble: 22 x 3
#> date y z
#> <date> <dbl> <dbl>
#> 1 2018-01-01 20 NA
#> 2 2018-02-01 10 NA
#> 3 2018-03-01 15 NA
#> 4 2018-04-01 35 NA
#> 5 2018-05-01 40 NA
#> 6 2018-06-01 50 NA
#> 7 2018-07-01 NA 28.3
#> 8 2018-08-01 NA 29.7
#> 9 2018-09-01 NA 33.0
#> 10 2018-10-01 NA 36.0
#> # ... with 12 more rows
答案 0 :(得分:2)
我们可以创建一个使用简单的for
循环作为简单解决方案的函数。
recursive_roll <- function(x, fn = mean, window_size = 6, ...) {
# Use fn (mean by default) on a rolling recursive window
# ... are arguments passed to fn
n <- length(x)
result <- x
for ( i in (window_size + 1):n ) {
result[i] <- fn(result[(i - window_size):(i - 1)], ...)
}
# I add in this line below to make it in line with your desired output.
# You may choose to omit this (keep the initial values of your vector),
# or even make this part optional.
result[1:window_size] <- NA
return(result)
}
需要注意的一点是,算法最终会收敛为一个重复的数字。我使用50个观察值而不是22个观察值来证明这一点:
library(dplyr)
library(lubridate)
N <- 50 # Total number of observations; I use 50 to illustrate convergence
window_size <- 6
df <- tibble(
date = seq.Date(ymd("2018-01-01"), by = "month", length.out = N),
y = c(20, 10, 15, 35, 40, 50, rep(NA, N - window_size))
)
desired_df <- df %>% mutate(z = recursive_roll(y))
让我们检查一下结果
desired_df
# A tibble: 50 x 3
date y z
<date> <dbl> <dbl>
1 2018-01-01 20 NA
2 2018-02-01 10 NA
3 2018-03-01 15 NA
4 2018-04-01 35 NA
5 2018-05-01 40 NA
6 2018-06-01 50 NA
7 2018-07-01 NA 28.3
8 2018-08-01 NA 29.7
9 2018-09-01 NA 33.0
10 2018-10-01 NA 36.0
# … with 40 more rows
tail(desired_df)
# A tibble: 6 x 3
date y z
<date> <dbl> <dbl>
1 2021-09-01 NA 34.5
2 2021-10-01 NA 34.5
3 2021-11-01 NA 34.5
4 2021-12-01 NA 34.5
5 2022-01-01 NA 34.5
6 2022-02-01 NA 34.5
plot(desired_df$date, desired_df$z, type = "l")
更具体地说,您的算法收敛到的数字可以解析为
r <- sum(1:window_size * head(desired_df$y, window_size)) / sum(1:window_size)
使用N = 500
后,我们看到
desired_df$z[N] == r
# [1] TRUE
sprintf("%.17f", c(desired_df$z[N], r))
# [1] "34.52380952380952550" "34.52380952380952550"
这是因为您仅使用window_size
个观测值;您可能会更喜欢的是指数加权移动平均线:
ewma <- function(x, weight = 1 / (length(x) + 1)) {
# Gives the exponentially weighted moving average, defined as:
# EWMA_t = weight * x_t + (1 - weight) * EWMA_{t-1}
result <- x
for ( i in 2:length(x) ) {
result[i] <- weight * result[i] + (1 - weight) * result[i - 1]
}
return(result)
}
set.seed(123)
N <- 50
x <- c(20, 10, 15, 35, 40, 50)
df <- tibble(
date = seq.Date(ymd("2018-01-01"), by = "month", length.out = N),
y = c(x, sample(30:50, size = N - window_size, replace = TRUE))
)
df2 <- df %>% mutate(z = recursive_roll(y), z2 = ewma(y))
plot(df2$date, df2$y, pch = 20, col = "#80808080")
lines(df2$date, df2$z, col = "blue")
lines(df2$date, df2$z2, col = "red")