R:时间加权平均值,箱和时间记录重叠

时间:2016-02-17 11:21:10

标签: r

我有一组预先指定的休息时间,

breaks = 2:7

形成一组箱子:(2,3] (3,4] (4,5] (5,6] (6,7]。然后我有一个看起来像这样的数据集

set.seed(42)
data = cbind.data.frame(time = cumsum(abs(rnorm(10))), value = rnorm(10))
> data
       time      value
1  1.370958  1.3048697
2  1.935657  2.2866454
3  2.298785 -1.3888607
4  2.931648 -0.2787888
5  3.335916 -0.1333213
6  3.442040  0.6359504
7  4.953562 -0.2842529
8  5.048222 -2.6564554
9  7.066645 -2.4404669
10 7.129359  1.3201133

time视为value更新的时间,因此值是分段常数。 什么是计算上述每个箱子的value加权平均值的聪明方法?我想要的结果如下:

     bin      mean
1  (2,3] -0.546621
2  (3,4]       ...

我将时间加权平均值计算为

(data$time[3]-2) * data$value[3] + 
  (data$time[4]-data$time[3])*data$value[4] + 
  (3-data$time[5]) * data$value[5]

请注意,问题是从箱柜的边框计算加权平均值。否则,我只需weighted.mean,然后选择weights作为diff(data$time)。我想出的唯一可行策略是向data添加行,其中时间是中断时间,前一个值被复制,即:

> data.mod
       time      value
1  1.370958  1.3048697
2  1.935657  2.2866454
3  2.000001  2.2866454
4  2.298785 -1.3888607
5  2.931648 -0.2787888
6  3.000001 -0.2787888
7   ...

然后我cutsplit并取weighted.mean并完成所有工作。但添加这些行的唯一方法是慢循环,对于我的实际数据length(breaks)介于500到20,000之间,dim(data)[1]大约是10,000 - 50,000,我必须重复此操作至少2,000次所以速度很高兴。

2 个答案:

答案 0 :(得分:1)

可以使用stepfun来计算data.mod

library(stats)

data <- read.table(
  header = TRUE,
  text =
 "time      value
  1.370958  1.3048697
  1.935657  2.2866454
  2.298785 -1.3888607
  2.931648 -0.2787888
  3.335916 -0.1333213
  3.442040  0.6359504
  4.953562 -0.2842529
  5.048222 -2.6564554
  7.066645 -2.4404669
  7.129359  1.3201133" )

breaks <- 2:7

f <- stepfun( x = data$time,
              y = c(data$value[1],data$value),
              right = FALSE )

t <- c( data$time , breaks )
v <- c( data$value, f(breaks) )
n <- order(t)

data.mod <- data.frame( time  = t[n],
                        value = v[n]  )

data.mod
# time      value
# 1  1.370958  1.3048697
# 2  1.935657  2.2866454
# 3  2.000000  2.2866454
# 4  2.298785 -1.3888607
# 5  2.931648 -0.2787888
# 6  3.000000 -0.2787888
# 7  3.335916 -0.1333213
# 8  3.442040  0.6359504
# 9  4.000000  0.6359504
# 10 4.953562 -0.2842529
# 11 5.000000 -0.2842529
# 12 5.048222 -2.6564554
# 13 6.000000 -2.6564554
# 14 7.000000 -2.6564554
# 15 7.066645 -2.4404669
# 16 7.129359  1.3201133

答案 1 :(得分:1)

使用 dplyr tidyr 的组合,我会按如下方式处理:

library(dplyr)
library(tidyr)
dat %>%
  mutate(bin = gsub("\\(|\\]","",cut(time, floor(min(time)):ceiling(max(time))))) %>%
  separate(bin, c("start","end"), ",", remove=FALSE, convert=TRUE) %>%
  mutate(next.time = lead(time),
         next.value = lead(value)) %>%
  group_by(bin) %>%
  summarise(mn = (time[1]-start[1])*value[1] + 
              (time[n()]-time[1])*value[n()] + 
              (end[1]-next.time[n()])*next.value[n()]) %>%
  ungroup() %>%
  slice(2:(n()-1))

这给出了:

Source: local data frame [4 x 2]

    bin         mn
  (chr)      (dbl)
1   2,3 -0.5466210
2   3,4  0.2937581
3   4,5 -0.1429546
4   5,6  2.4750141

特别是当速度和内存效率成为问题时,您也可以使用 data.table 包执行此操作:

library(data.table)
setDT(dt)[, bin := gsub("\\(|\\]","",cut(time, floor(min(time)):ceiling(max(time))))
          ][, c("start","end") := tstrsplit(bin, ",", fixed=TRUE, type.convert = TRUE)
            ][, `:=` (next.time = shift(time, type="lead"), next.value = shift(value, type="lead"))
              ][, .(mn = (time[1]-start[1])*value[1] + 
                      (time[.N]-time[1])*value[.N] + 
                      (end[1]-next.time[.N])*next.value[.N]), 
                by = bin][2:(.N-1)][]

给出相同的结果:

   bin         mn
1: 2,3 -0.5466210
2: 3,4  0.2937581
3: 4,5 -0.1429546
4: 5,6  2.4750141