计算数据框中每秒值的平均值

时间:2015-06-24 09:37:47

标签: r dataframe aggregate mean

我想按平均值聚合每两个单元格值,并在数据框的列下继续相同的过程。 更准确地说,请参阅以下数据框提取:

    X         Y             Z
1   FRI 200101010000    -6.72
2   FRI 200101010030    -6.30
3   FRI 200101010100    -6.26
4   FRI 200101010130    -5.82
5   FRI 200101010200    -5.64
6   FRI 200101010230    -5.29
7   FRI 200101010300    -5.82
8   FRI 200101010330    -5.83
9   FRI 200101010400    -5.83
10  FRI 200101010430    -6.04
11  FRI 200101010500    -5.80
12  FRI 200101010530    -6.09

我想计算每个Z到Y的平均值,以00和30结尾,这意味着计算#row 1 + 2,#3 + 4,#5 + 6等的平均值...看...我期待的是:

    X         Y             Z
1   FRI 200101010100    -6.51
2   FRI 200101010200    -6.04
3   FRI 200101010300    -5.47
...

说明:Y是时间:YYYYMMDDhhmm,我想平均测量30分钟到1小时的测量值

4 个答案:

答案 0 :(得分:4)

这是一个可能的data.table解决方案

library(data.table)
setDT(df)[, .(Y = Y[1L], Z = mean(Z)), by = .(X, indx = cumsum(substr(Y, 11, 12) == '00'))]
#      X indx            Y      Z
# 1: FRI    1 200101010000 -6.510
# 2: FRI    2 200101010100 -6.040
# 3: FRI    3 200101010200 -5.465
# 4: FRI    4 200101010300 -5.825
# 5: FRI    5 200101010400 -5.935
# 6: FRI    6 200101010500 -5.945

或者根据@akruns评论,使用aggregate从基础开始(虽然输出可能需要一些额外的推文)

aggregate(Z ~ X + indx, transform(df, indx = cumsum(substr(Y, 11, 12) == '00')), mean)

答案 1 :(得分:2)

基础R解决方案,我首先将矢量分成几部分并计算每个部分的平均值,这当然假设您指定的顺序始终为真。最后我将它们组合起来给出你的结果:

Z <- unlist(lapply(split(df$Z, ceiling(seq_along(df$Z) / 2)), mean))
new_df <- cbind(df[seq(1,nrow(df), 2), c("X", "Y")], Z)

输出:

     X            Y      Z
1  FRI 200101010000 -6.510
3  FRI 200101010100 -6.040
5  FRI 200101010200 -5.465
7  FRI 200101010300 -5.825
9  FRI 200101010400 -5.935
11 FRI 200101010500 -5.945

答案 2 :(得分:1)

dplyr version

library(dplyr)

df$Y <- as.character(df$Y)

means <- df %>%
        group_by(hour = substr(Y, start = 1, stop=10)) %>% summarise(Z = mean(Z))

> means
Source: local data frame [6 x 2]

        hour      Z
1 2001010100 -6.510
2 2001010101 -6.040
3 2001010102 -5.465
4 2001010103 -5.825
5 2001010104 -5.935
6 2001010105 -5.945

按Y变量对数据进行分组,不包括最后两位数字。

答案 3 :(得分:0)

虽然这不能解决OP问题,但如果您有POSIXct列,通常会按秒聚合:

library(lubridate)
library(tidyverse)
s <- seq(from=Sys.time(), length.out = 100, by=0.4)
df  = data.frame(time=s,v=rnorm(length(s)))
df %>% 
  group_by(time=floor_date(time, "1 second")) %>% 
  summarize(v=mean(v)) # you can put any other interval like 5 minute