我正在尝试计算一些时间序列值的移动总和。但是,数据量巨大。我不确定实际上最快的方法是什么。
这是我尝试过的:
1.使用data.tables
和filter
2. sapplying
,可以使用foreach
包进行并行化。但我认为应该有一个更简洁的方法来做到这一点
以下是代码示例:
set.seed(12345)
library(dplyr)
library(data.table)
# Generate random data
ts = seq(from = as.POSIXct(1447155253, origin = "1970-1-1"), to = as.POSIXct(1447265253, origin = "1970-1-1"), by ="min")
value = sample(1:10, length(ts), replace = T)
sampleDF = data.frame(timestamp = ts, value = value )
sampleDF = as.data.table(sampleDF)
# Pre-manipulations
slidingwindow = 5*60 # 5 minutes window
end.ts = sampleDF$timestamp[length(sampleDF$timestamp)] - slidingwindow
end.i = which(sampleDF$timestamp >= end.ts)[1]
# Apply rolling sum
system.time(
sapply( 1:end.i,
FUN = function(i) {
from = sampleDF$timestamp[i] # starting point
to = from + slidingwindow # ending point
sum = filter(sampleDF, timestamp >= from, timestamp < to) %>% .$value %>% sum # Filter and sum
return( sum)
})
)
# user system elapsed
# 5.60 0.00 5.69
您的建议将不胜感激: - )