对于30分钟的采样数据,每天采用7天窗口的滚动平均值

时间:2016-04-14 17:38:17

标签: r data.table

我想采用7天滚动窗口的平均值,每隔30分钟收集1天的数据增量。 我尝试将data.tableby条件语句一起使用但没有成功。任何guidane将不胜感激。

# packages
library(data.table)
library(lubridate)

# Set set.seed to have reproducible sampling 
set.seed(42)

# Create some Data
start = ymd_hms("2014-01-01 00:00:00")
end = ymd_hms("2014-12-31 23:59:59")

# Create data with 30 minute intervals.
dat <- data.table(timestamp = seq(start, end, by = "30 min"),
                  sample1 = sample(1:20, 17520, replace = TRUE))

# Create date variable for merging datasets.
dat[, date := as.Date(timestamp)]

# Create data for 7 day window moving window with one day increments.
dat2 <- data.table(start = seq(start, end, by = "1 day"),
                  end = seq(start + days(7), end + days(7), by = "1 day"))

# Create date variable for merging datasets.
dat2[, date := as.Date(start)]

# mergre datasets.
dat <- merge(dat, dat2, by="date")

# Tried 
dat[, .(sample.mean = mean(sample1)), by = .(timestamp >= start & timestamp < end)]
#    timestamp sample.mean
# 1:      TRUE    10.46638

dat[, .(sample.mean = mean(sample1)), by = .(timestamp %in% c(start:end))]
#    timestamp sample.mean
# 1:      TRUE    10.40059
# 2:     FALSE    10.46767
#  Warning messages:
# 1: In start:end :
#  numerical expression has 17520 elements: only the first used
# 2: In start:end :
#   numerical expression has 17520 elements: only the first used

dat[, .(sample.mean = mean(sample1)), by = .(timestamp %between% c(start, end))]
#    timestamp sample.mean
# 1:      TRUE    19.00000
# 2:     FALSE    10.46589

2 个答案:

答案 0 :(得分:2)

我不是100%确定我理解您的确切参数,但这是基本方法:

setkey(dat, date)

#pull the 7 previous days 
dat[ , dat[.(seq(.BY$date - 7L,
                 .BY$date, by = "day")),  
           #nomatch = 0L will exclude any requested dates outside the interval
           mean(sample1), nomatch = 0L], by = date]
#            date       V1
#   1: 2014-01-01 12.31250
#   2: 2014-01-02 10.94792
#   3: 2014-01-03 11.27083
#   4: 2014-01-04 11.10417
#   5: 2014-01-05 10.79167
#  ---                    
# 361: 2014-12-27 10.50260
# 362: 2014-12-28 10.52344
# 363: 2014-12-29 10.05990
# 364: 2014-12-30 10.03906
# 365: 2014-12-31 10.38542

一些可能的修补匠:

  • 7L更改为您想要的任何窗口;如果你想要前瞻性的平均值,请使用积极的

  • 如果您想转by timestamp,则必须调整7L以匹配任何单位(秒/分钟/小时/等)

  • 由于窗口比请求的短,所以间隔的极值点在技术上不正确;排除nomatch,这些点将返回NA

  • 使用.(var = mean(sample1))命名输出列var

答案 1 :(得分:2)

这是一种方法:

library(zoo)
daymeans = dat[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat[daymeans, rmean := i.rmean, on="date"]

这假设您的数据已按date排序;如果没有,请使用keyby=date代替by=date。如果您不想处理中间对象,可以使用单行程序:

# Michael Chirico's suggestion from the comments
dat[dat[, mean(sample1), by=date][, rollmean(V1, 7, fill=NA)], rmean := i.V1, on = "date"]

您可能需要调整rollmean的参数以适合您对窗口的特定定义。 @eddi建议来自caTools库的runmean通常比动物园的rollmean更快,所以也许值得一看。

OP的示例数据的粗略基准:

dat2 = copy(dat)

# Michael's answer
system.time({
setkey(dat, date)
dat[ , dat[.(seq(.BY$date - 7L,
                 .BY$date, by = "day")),  
           mean(sample1), nomatch = 0L], by = date]
})

   user  system elapsed 
   0.33    0.00    0.35

# this answer
system.time({
daymeans = dat2[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat2[daymeans, rmean := i.rmean, on="date"]
})

   user  system elapsed 
      0       0       0 

为什么它更快:在这里,我们计算365个数字的48个数字,然后是长度为365的滚动均值;计算成本低于365合并以找到48 * 7数字然后取后者的平均值。