移动平均每过去5分钟的数据

时间:2016-11-07 07:42:28

标签: r group-by aggregate

这是我的数据

"5min_Ret"

我想为每个刻度创建另一列last 5 mins average of return,其值应为 times value size return 5min_Ret Logic 2016-06-01 9:07:11 14.2 595 0 0 First Tick 0 2016-06-01 9:08:11 14.2 2505 0.003527341 0.001763671 Avg of 1 to 2 2016-06-01 9:11:03 14.15 1 0 0.00117578 Avg of 1 to 3 2016-06-01 9:13:03 14.15 2200 0.003527341 0.002351561 Avg of 2 to 4 2016-06-01 9:15:04 14.2 480 0 0.00117578 Avg of 3 to 5 2016-06-01 9:15:04 14.2 2965 0.003527341 0.001763671 Avg of 3 to 6 2016-06-01 9:15:05 14.2 144 0 0.001410936 Avg of 3 to 7 2016-06-01 9:20:05 14.2 1856 0.003514942 0.001757471 Avg of 7 to 8 2016-06-01 9:22:06 14.25 300 0 0.001757471 Avg of 8 to 9 2016-06-01 9:25:06 14.25 856 0.003514942 0.001757471 Avg of 9 to 10 。下面是每行末尾提到的计算逻辑的所需输出。逻辑专栏只是在这里解释。它不会添加到最终输出中。

dplyr

我认为This is source data0 [ 532.038 532.467 532.897 532.579 531.834 531.089 530.344 530.243 529.637 529.871 530.586 531.302 531.528 531.674 531.562 531.562] This is the imfs for souce data0 [[ 4.99536300e-02 5.07521024e-01 1.15778456e+00 1.12993996e+00 7.67565359e-01 4.12133844e-01 -1.81761588e-02 1.82634342e-02 -5.76022792e-01 -5.16983337e-01 -8.86904761e-02 2.36815870e-01 1.38870440e-01 7.08367478e-02 -1.27149210e-01 -1.13787989e-01] [ -4.58838235e-04 1.18438903e-01 1.53245692e-01 1.34404459e-01 7.60518794e-02 1.67176195e-02 -3.79650223e-02 -5.60086247e-02 -7.75462828e-02 -7.00926985e-02 -2.94792254e-02 3.22931827e-02 6.15527167e-02 5.16516550e-02 4.25997864e-03 -5.38057521e-02] [ -1.13008493e-01 1.05889951e-01 1.65761000e-01 1.63480749e-01 6.48455348e-02 -9.18077666e-02 -2.36833140e-01 -2.97692545e-01 -2.79863120e-01 -1.55546830e-01 -1.07397933e-02 1.61763712e-01 2.56023595e-01 2.38445996e-01 9.00409154e-02 -1.86476311e-01] [ nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]] Plotting IMF #1 Plotting IMF #2 Plotting IMF #3 Plotting Residual This is source data1 [ 530.524 530.452 530.417 530.176 530.567 530.731 530.878 531.32 531.942 532.039 531.816 531.593 531.126 531.353 531.257 531.248] This is the imfs for source data1 [[-0.06378673 -0.07530695 -0.04069713 -0.30207195 -0.02267617 -0.07398937 -0.21837115 -0.12946676 0.21435049 0.18605721 0.04908956 0.00394656 -0.26659788 0.08695065 0.04803377 0.02217659] [ 0.03048818 0.01693255 -0.02122604 -0.06449743 -0.08466269 -0.0725593 -0.04595078 0.01500129 0.07128166 0.07859381 0.03046378 -0.04452977 -0.0963699 -0.09101547 -0.05157518 -0.003445 ] [ 0.20185892 -0.00429606 -0.19287011 -0.27632151 -0.27612168 -0.19247013 -0.03727295 0.15981007 0.30781758 0.37327476 0.29858615 0.17610284 0.03574206 -0.06765531 -0.1129184 -0.07166027] [ nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]] Plotting IMF #1 Plotting IMF #2 Plotting IMF #3 Plotting Residual 包对group by非常有用。但是对于每个滴答,我无法成功按间隔5分钟获得数据分组。感谢R中的任何建议/帮助。

感谢。

2 个答案:

答案 0 :(得分:2)

您可以使用sapply实现此目的。我们假设您的对象名为df

df$'5min_ret' <- sapply( X = seq_along( df$return ), 
                        FUN = function(x) { 
                            mean( df$return[ df$times >= df$times[x] - 5*60 & 
                                                   df$times <= df$times[x] ] ) 
                        } )

注意seq_along调用只是创建一个与数据帧中行数相同的向量序列(在您的情况下为10)。

FUN之后定义的函数非常重要。该函数采用数据帧的一个子集,其中时间在最后5分钟内(大于5分钟前,小于现在),并采用剩下的return列的平均值。 sapply只为X的每个值运行该函数(这是我们的1:10序列)。

但请注意,调用列5min_ret通常不是一个好主意,因为R并不特别喜欢该表单的名称。我已经在创作的引文中包围了它以解决这个问题,但我建议考虑一个不同的名字。

答案 1 :(得分:1)

df = data.frame(times = c("2016-06-01 9:07:11", "2016-06-01 9:08:11", "2016-06-01 9:11:03", "2016-06-01 9:13:03","2016-06-01 9:15:04 ","2016-06-01 9:15:04", "2016-06-01 9:15:05",
                           "2016-06-01 9:20:05", "2016-06-01 9:22:06", "2016-06-01 9:25:06"),
                 return = c( 0, 0.003527341, 0, 0.003527341, 0, 0.003527341, 0, 0.003514942, 0, 0.003514942))
df$times = as.POSIXct(df$times)
df
             times      return
1  2016-06-01 09:07:11 0.000000000
2  2016-06-01 09:08:11 0.003527341
3  2016-06-01 09:11:03 0.000000000
4  2016-06-01 09:13:03 0.003527341
5  2016-06-01 09:15:04 0.000000000
6  2016-06-01 09:15:04 0.003527341 
7  2016-06-01 09:15:05 0.000000000
8  2016-06-01 09:20:05 0.003514942
9  2016-06-01 09:22:06 0.000000000
10 2016-06-01 09:25:06 0.003514942

# another dataframe for the start/end timeframe
df1  = data.frame("start" = df$times - 5*60, "end" = as.POSIXct(df$times))
df1
          start                 end 
1  2016-06-01 09:02:11 2016-06-01 09:07:11
2  2016-06-01 09:03:11 2016-06-01 09:08:11
3  2016-06-01 09:06:03 2016-06-01 09:11:03
4  2016-06-01 09:08:03 2016-06-01 09:13:03
5  2016-06-01 09:10:04 2016-06-01 09:15:04
6  2016-06-01 09:10:04 2016-06-01 09:15:04
7  2016-06-01 09:10:05 2016-06-01 09:15:05
8  2016-06-01 09:15:05 2016-06-01 09:20:05
9  2016-06-01 09:17:06 2016-06-01 09:22:06
10 2016-06-01 09:20:06 2016-06-01 09:25:06

library(dplyr)
df.mean <- df1 %>% 
   group_by(start, end) %>% 
   summarize(ret.mean = mean(df$return[df$times >= start & df$times <= end]))
df.mean
Source: local data frame [9 x 3]
Groups: start [?]

            start                 end    ret.mean
           (time)              (time)       (dbl)
1 2016-06-01 09:02:11 2016-06-01 09:07:11 0.000000000
2 2016-06-01 09:03:11 2016-06-01 09:08:11 0.001763670
3 2016-06-01 09:06:03 2016-06-01 09:11:03 0.001175780
4 2016-06-01 09:08:03 2016-06-01 09:13:03 0.002351561
5 2016-06-01 09:10:04 2016-06-01 09:15:04 0.001763670
6 2016-06-01 09:10:05 2016-06-01 09:15:05 0.001410936
7 2016-06-01 09:15:05 2016-06-01 09:20:05 0.001757471
8 2016-06-01 09:17:06 2016-06-01 09:22:06 0.001757471
9 2016-06-01 09:20:06 2016-06-01 09:25:06 0.001757471

您会发现第5组和第6组已合并,因为它们具有相同的边界。我已经逐步完成了程序,以便您能够理解该方法。您可以稍后将它们全部放在一个数据框中