Question

我研究了几种用于创建移动或滚动平均值的工具（rollapply（）; filter（）; runmean（）; group_by（）和summarise（）），它们假定了一个连续的数据序列。

但是，我希望将一个函数应用于周期性数据中的重复移动窗口，因此该窗口需要环绕数据系列的末尾。

具体来说，例如，从一年中每天20年的数据序列中查找30天的每日数据的平均值，即产生366个值。有趣的一点是-好像没有使用rollapply（）之类的地方-再次举例来说，第10天的值（从每年1月10日之前的29天开始）包括20天数据系列最后一年的结束。

在其他地方（R daily multiyear mean）提供了一种简单的解决方案，可通过多年包装获得多年平均值的每日平均值和标准差，并具有以下要求：

require(tibble)
require(lubridate)
require(dplyr)
df <- tibble(date=seq.Date(from=as.Date("1968-10-01"), to=as.Date("1973-09-30"), by="days"),
          value=runif(length(date)))

df %>%
  group_by(dayIdx = yday(date)) %>%
  summarise(mean_val = mean(value, na.rm=TRUE), sd_val = sd(value, na.rm=TRUE))

和group_by（）可以进一步用于每个数据系列生成结果，但是我不知道如何将其应用于多日窗口。

在下面的示例中，有：

每个期间（即年份）的12个值（索引为0:11）和
5个周期，
一个数据系列包含60条记录（1:60）和

对于3条记录的平均窗口：

结果的第二条记录（索引1），mean（）和sd（）函数应用于记录：

1,2,12,13,14,24,25,26,36,37,38,48,49,50,60

索引值为0,1,11。

据我对rollapply（）的理解，记录60不包括在内。

为了使它更有趣，我正在处理多个数据系列，这些数据系列具有不同的长度（年数）和多个平均窗口。

我可以使用for ..循环来产生函数结果，但是我敢肯定，有些模糊的方法使我难以理解。

require(dplyr)
require(tibble)
##create data
num_series <- 3 #the number of dataseries
len_period <- 12 #number of records per 'year'
num_periods <- 5 #number of 'years'
df <- tibble(series_id = rep(1:num_series, each = num_periods *  len_period), rec_num = rep(1:(len_period * num_periods), times = num_series), idx = rep(0:(len_period-1), times = num_series * num_periods) , data_val = runif(length(idx)))

## vector of averaging windows
av_window <- c(3, 6, 9)

##prepare empty tibble
result <- tibble(series_id = rep(1:num_series, each = length(av_window) * len_period), idx = rep(0:(len_period-1), times = length(av_window) * num_series), av_win = rep(rep(av_window, each = len_period), times = num_series), win_start = (idx - av_win + 1) %% len_period, val_mean = 0.0, val_sd = 0.0)

## loop through data series
data_series <- distinct(result, series_id)
for (id in data_series$series_id){
  ##loop through averaging windows
  for (av_win_val in av_window){
    ##loop through record indices
    for (idx_val in (0:(len_period-1))){
      dfTemp <- subset(df, (series_id == id) & (idx %in% ((idx_val + ((-av_win_val + 1):0)) %% len_period)))
      logical_vec <- result$series_id == id & result$av_win == av_win_val & result$idx == idx_val
      result[logical_vec, "val_mean"] <- mean(dfTemp$data_val, na.rm =TRUE)
      result[logical_vec, "val_sd"] <- sd(dfTemp$data_val, na.rm =TRUE)
    }
  }    
}

从概念上讲，一种解决方案可能是将数据转换为“天”行和“年”列，然后将该函数应用于移动的2D块中的所有值，但这仍然无法解决循环的问题。这个问题。但是，可以想象将这个概念扩展到多个数据系列的第三维。

任何建议都会受到欢迎-JS

编辑：来自Hadley Wickham's split-apply-compbine (plyr) paper：

请注意，plyr做出了一个强有力的假设，即每条数据将只处理一次，并且独立于所有其他数据。这意味着，当每个迭代需要重叠的数据（例如运行平均值），或者它依赖于先前的迭代（例如在动态仿真中）时，您将无法使用这些工具。循环仍然最适合这些任务。

EDIT2 ：此summary of timeseries manipulation and analysis tools可能有用，但是很漂亮... 密集。

移动函数应用于具有不连续窗口的周期性数据

0 个答案: