R中的递归滚动平均值

时间:2019-05-06 18:59:16

标签: r dplyr rolling-computation

从以下内容开始:

library(tidyverse)
library(lubridate)

df <- tibble(
  date = seq.Date(ymd("2018-01-01"), by = "month", length.out = 6),
  y = c(20, 10, 15, 35, 40, 50)
)

df 
#> # A tibble: 6 x 2
#>   date           y
#>   <date>     <dbl>
#> 1 2018-01-01    20
#> 2 2018-02-01    10
#> 3 2018-03-01    15
#> 4 2018-04-01    35
#> 5 2018-05-01    40
#> 6 2018-06-01    50

我想创建一个新列z,该列是递归滚动6周期的平均值。也就是说,对于2018-07-01,这只是最近六条记录的平均值,但是对于2018-08-01,我们在新的滚动计算中使用先前计算的(相关)滚动平均值。

2018-07-01 = mean(c(20, 10, 15, 35, 40, 50)) = 28.3333
2018-08-01 = mean(c(10, 15, 35, 40, 50, 28.3333)) = 29.7222
2018-09-01 = mean(c(15, 35, 40, 50, 28.3333, 29.7222) = 33.0093
...etc...

我已经使用tibbletime::rollifyzoo::rollmeanr进行了一些尝试,但都不允许我递归地引用上次计算的滚动平均值。

所需的输出:

desired_df <- tibble(
  date = seq.Date(ymd("2018-01-01"), by = "month", length.out = 22),
  y = c(20, 10, 15, 35, 40, 50, rep(NA, 16)),
  z = c(
    rep(NA, 6), 
    28.3333, 29.7222, 33.0093, 36.0108, 36.1793, 35.5425, 33.1329, 
    33.9328, 34.6346, 34.9055, 34.7213, 34.4783, 34.3009, 34.4955, 
    34.5893, 34.5818
  )
)
desired_df
#> # A tibble: 22 x 3
#>    date           y     z
#>    <date>     <dbl> <dbl>
#>  1 2018-01-01    20  NA  
#>  2 2018-02-01    10  NA  
#>  3 2018-03-01    15  NA  
#>  4 2018-04-01    35  NA  
#>  5 2018-05-01    40  NA  
#>  6 2018-06-01    50  NA  
#>  7 2018-07-01    NA  28.3
#>  8 2018-08-01    NA  29.7
#>  9 2018-09-01    NA  33.0
#> 10 2018-10-01    NA  36.0
#> # ... with 12 more rows

1 个答案:

答案 0 :(得分:2)

我们可以创建一个使用简单的for循环作为简单解决方案的函数。

recursive_roll <- function(x, fn = mean, window_size = 6, ...) {
    # Use fn (mean by default) on a rolling recursive window
    # ... are arguments passed to fn
    n <- length(x)
    result <- x
    for ( i in (window_size + 1):n ) {
        result[i] <- fn(result[(i - window_size):(i - 1)], ...)
    }
    # I add in this line below to make it in line with your desired output.
    # You may choose to omit this (keep the initial values of your vector),
    # or even make this part optional.
    result[1:window_size] <- NA
    return(result)
}

需要注意的一点是,算法最终会收敛为一个重复的数字。我使用50个观察值而不是22个观察值来证明这一点:

library(dplyr)
library(lubridate)

N <- 50 # Total number of observations; I use 50 to illustrate convergence
window_size <- 6

df <- tibble(
    date = seq.Date(ymd("2018-01-01"), by = "month", length.out = N),
    y = c(20, 10, 15, 35, 40, 50, rep(NA, N - window_size))
)

desired_df <- df %>% mutate(z = recursive_roll(y))

让我们检查一下结果

desired_df
# A tibble: 50 x 3
   date           y     z
   <date>     <dbl> <dbl>
 1 2018-01-01    20  NA  
 2 2018-02-01    10  NA  
 3 2018-03-01    15  NA  
 4 2018-04-01    35  NA  
 5 2018-05-01    40  NA  
 6 2018-06-01    50  NA  
 7 2018-07-01    NA  28.3
 8 2018-08-01    NA  29.7
 9 2018-09-01    NA  33.0
10 2018-10-01    NA  36.0
# … with 40 more rows
tail(desired_df)
# A tibble: 6 x 3
  date           y     z
  <date>     <dbl> <dbl>
1 2021-09-01    NA  34.5
2 2021-10-01    NA  34.5
3 2021-11-01    NA  34.5
4 2021-12-01    NA  34.5
5 2022-01-01    NA  34.5
6 2022-02-01    NA  34.5

plot(desired_df$date, desired_df$z, type = "l")

enter image description here

更具体地说,您的算法收敛到的数字可以解析为

r <- sum(1:window_size * head(desired_df$y, window_size)) / sum(1:window_size)

使用N = 500后,我们看到

desired_df$z[N] == r
# [1] TRUE
sprintf("%.17f", c(desired_df$z[N], r))
# [1] "34.52380952380952550" "34.52380952380952550"

这是因为您仅使用window_size个观测值;您可能会更喜欢的是指数加权移动平均线:

ewma <- function(x, weight = 1 / (length(x) + 1)) {
    # Gives the exponentially weighted moving average, defined as:
    # EWMA_t = weight * x_t + (1 - weight) * EWMA_{t-1}
    result <- x
    for ( i in 2:length(x) ) {
        result[i] <- weight * result[i] + (1 - weight) * result[i - 1]
    }
    return(result)
}

set.seed(123)
N <- 50
x <- c(20, 10, 15, 35, 40, 50)
df <- tibble(
    date = seq.Date(ymd("2018-01-01"), by = "month", length.out = N),
    y = c(x, sample(30:50, size = N - window_size, replace = TRUE))
)

df2 <- df %>% mutate(z = recursive_roll(y), z2 = ewma(y))
plot(df2$date, df2$y, pch = 20, col = "#80808080")
lines(df2$date, df2$z, col = "blue")
lines(df2$date, df2$z2, col = "red")

enter image description here