我想采用7天滚动窗口的平均值,每隔30分钟收集1天的数据增量。
我尝试将data.table
与by
条件语句一起使用但没有成功。任何guidane将不胜感激。
# packages
library(data.table)
library(lubridate)
# Set set.seed to have reproducible sampling
set.seed(42)
# Create some Data
start = ymd_hms("2014-01-01 00:00:00")
end = ymd_hms("2014-12-31 23:59:59")
# Create data with 30 minute intervals.
dat <- data.table(timestamp = seq(start, end, by = "30 min"),
sample1 = sample(1:20, 17520, replace = TRUE))
# Create date variable for merging datasets.
dat[, date := as.Date(timestamp)]
# Create data for 7 day window moving window with one day increments.
dat2 <- data.table(start = seq(start, end, by = "1 day"),
end = seq(start + days(7), end + days(7), by = "1 day"))
# Create date variable for merging datasets.
dat2[, date := as.Date(start)]
# mergre datasets.
dat <- merge(dat, dat2, by="date")
# Tried
dat[, .(sample.mean = mean(sample1)), by = .(timestamp >= start & timestamp < end)]
# timestamp sample.mean
# 1: TRUE 10.46638
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %in% c(start:end))]
# timestamp sample.mean
# 1: TRUE 10.40059
# 2: FALSE 10.46767
# Warning messages:
# 1: In start:end :
# numerical expression has 17520 elements: only the first used
# 2: In start:end :
# numerical expression has 17520 elements: only the first used
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %between% c(start, end))]
# timestamp sample.mean
# 1: TRUE 19.00000
# 2: FALSE 10.46589
答案 0 :(得分:2)
我不是100%确定我理解您的确切参数,但这是基本方法:
setkey(dat, date)
#pull the 7 previous days
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
#nomatch = 0L will exclude any requested dates outside the interval
mean(sample1), nomatch = 0L], by = date]
# date V1
# 1: 2014-01-01 12.31250
# 2: 2014-01-02 10.94792
# 3: 2014-01-03 11.27083
# 4: 2014-01-04 11.10417
# 5: 2014-01-05 10.79167
# ---
# 361: 2014-12-27 10.50260
# 362: 2014-12-28 10.52344
# 363: 2014-12-29 10.05990
# 364: 2014-12-30 10.03906
# 365: 2014-12-31 10.38542
一些可能的修补匠:
将7L
更改为您想要的任何窗口;如果你想要前瞻性的平均值,请使用积极的
如果您想转by
timestamp
,则必须调整7L
以匹配任何单位(秒/分钟/小时/等)
由于窗口比请求的短,所以间隔的极值点在技术上不正确;排除nomatch
,这些点将返回NA
使用.(var = mean(sample1))
命名输出列var
。
答案 1 :(得分:2)
这是一种方法:
library(zoo)
daymeans = dat[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat[daymeans, rmean := i.rmean, on="date"]
这假设您的数据已按date
排序;如果没有,请使用keyby=date
代替by=date
。如果您不想处理中间对象,可以使用单行程序:
# Michael Chirico's suggestion from the comments
dat[dat[, mean(sample1), by=date][, rollmean(V1, 7, fill=NA)], rmean := i.V1, on = "date"]
您可能需要调整rollmean
的参数以适合您对窗口的特定定义。 @eddi建议来自caTools库的runmean
通常比动物园的rollmean
更快,所以也许值得一看。
OP的示例数据的粗略基准:
dat2 = copy(dat)
# Michael's answer
system.time({
setkey(dat, date)
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
mean(sample1), nomatch = 0L], by = date]
})
user system elapsed
0.33 0.00 0.35
# this answer
system.time({
daymeans = dat2[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat2[daymeans, rmean := i.rmean, on="date"]
})
user system elapsed
0 0 0
为什么它更快:在这里,我们计算365个数字的48个数字,然后是长度为365的滚动均值;计算成本低于365合并以找到48 * 7数字然后取后者的平均值。