假设存在具有时间或距离列的2列数据帧,其顺序增加,并且观察列可以具有此处和那里的NA。我如何有效地使用滑动窗口函数来获得一些统计量,比如平均值,在持续时间为X的窗口中观察(例如5秒),将窗口滑过Y秒(例如2.5秒),重复... < strong>窗口中的观察数量基于时间列,因此每个窗口的观察数量和滑动窗口的观察数量可能会有所不同该函数应接受任何窗口大小,直到该数量观察和步长。
以下是示例数据(有关较大的样本集,请参阅&#34; 编辑:&#34;
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
head(dat)
time measure
1 1.914806 1.0222694
2 2.937075 0.3490641
3 3.286140 NA
4 4.830448 0.8112979
5 5.641746 0.8773504
6 6.519096 1.2174924
期望输出,用于5秒窗口的特定情况,2.5秒步长,第一个窗口从-2.5到2.5,na.rm = FALSE:
[1] 1.0222694
[2] NA
[3] NA
[4] 1.0126639
[5] 0.9965048
[6] 0.9514456
[7] 1.0518228
[8] NA
[9] NA
[10] NA
说明:在所需的输出中,第一个窗口查找介于-2.5和2.5之间的时间。测量的一个观察是在这个窗口中,它不是NA,因此我们得到了这个观察结果:1.0222694。下一个窗口是从0到5,窗口中有一个NA,所以我们得到NA。窗口从2.5到7.5相同。下一个窗口是5到10.窗口中有5个观察值,没有NA。因此,我们得到这5个观察值的平均值(即平均值(dat [dat $ time&gt; 5&amp; dat $ time&lt; 10,&#39; measure&#39;])
我尝试了什么:以下是我尝试的窗口特定情况,其中步长是窗口持续时间的1/2:
windo <- 5 # duration in seconds of window
# partition into groups depending on which window(s) an observation falls in
# When step size >= window/2 and < window, need two grouping vectors
leaf1 <- round(ceiling(dat$time/(windo/2))+0.5)
leaf2 <- round(ceiling(dat$time/(windo/2))-0.5)
l1 <- tapply(dat$measure, leaf1, mean)
l2 <- tapply(dat$measure, leaf2, mean)
as.vector(rbind(l2,l1))
不灵活,不优雅,效率不高。如果步长不是窗口大小,那么该方法将不起作用。
对这种问题的一般解决方案有何看法?任何解决方案都可接受越快越好,但我更喜欢使用基本R,data.table,Rcpp和/或并行计算的解决方案。在我的实际数据集中,数据帧列表中包含数百万个观测值(最大数据帧约为400,000个观测值)。
以下是额外信息:更大的样本集
编辑:根据请求,这是一个更大,更实际的示例数据集,其中包含更多的NA和最小时间跨度(~0.03)。但需要明确的是,数据框列表包含如上所示的小型数据框,以及类似以下内容的数据框:
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
答案 0 :(得分:6)
这是Rcpp的尝试。该函数假定数据按时间排序。建议进行更多测试,并进行调整。
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector rollAverage(const NumericVector & times,
NumericVector & vals,
double start,
const double winlen,
const double winshift) {
int n = ceil((max(times) - start) / winshift);
NumericVector winvals;
NumericVector means(n);
int ind1(0), ind2(0);
for(int i=0; i < n; i++) {
if (times[0] < (start+winlen)) {
while((times[ind1] <= start) &
(times[ind1+1] <= (start+winlen)) &
(ind1 < (times.size() - 1))) {
ind1++;
}
while((times[ind2+1] <= (start+winlen)) & (ind2 < (times.size() - 1))) {
ind2++;
}
if (times[ind1] >= start) {
winvals = vals[seq(ind1, ind2)];
means[i] = mean(winvals);
} else {
means[i] = NA_REAL;
}
} else {
means[i] = NA_REAL;
}
start += winshift;
}
return means;
}
测试它:
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
rollAverage(dat$time, dat$measure, -2.5, 5.0, 2.5)
#[1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA NA
使用data.frames列表(使用data.table):
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
library(data.table)
dat <- lapply(dat, setDT)
for (ind in seq_along(dat)) dat[[ind]][, i := ind]
#possibly there is a way to avoid these copies?
dat <- rbindlist(dat)
system.time(res <- dat[, rollAverage(time, measure, -2.5, 5.0, 2.5), by=i])
#user system elapsed
#1.51 0.02 1.54
print(res)
# i V1
# 1: 1 1.0217126
# 2: 1 0.9334415
# 3: 1 0.9609050
# 4: 1 1.0123473
# 5: 1 0.9965922
# ---
#6000596: 300 1.1121296
#6000597: 300 0.9984581
#6000598: 300 1.0093060
#6000599: 300 NA
#6000600: 300 NA
答案 1 :(得分:2)
这是一个为小数据框提供相同结果的函数。它不是特别快:在第二个dat
示例中,在一个较大的数据集上运行需要几秒钟。
rolling_summary <- function(DF, time_col, fun, window_size, step_size, min_window=min(DF[, time_col])) {
# time_col is name of time column
# fun is function to apply to the subsetted data frames
# min_window is the start time of the earliest window
times <- DF[, time_col]
# window_starts is a vector of the windows' minimum times
window_starts <- seq(from=min_window, to=max(times), by=step_size)
# The i-th element of window_rows is a vector that tells us the row numbers of
# the data-frame rows that are present in window i
window_rows <- lapply(window_starts, function(x) { which(times>=x & times<x+window_size) })
window_summaries <- sapply(window_rows, function(w_r) fun(DF[w_r, ]))
data.frame(start_time=window_starts, end_time=window_starts+window_size, summary=window_summaries)
}
rolling_summary(DF=dat,
time_col="time",
fun=function(DF) mean(DF$measure),
window_size=5,
step_size=2.5,
min_window=-2.5)
答案 2 :(得分:2)
以下是一些在第一个示例中提供相同输出的函数:
partition <- function(x, window, step = 0){
a = x[x < step]
b = x[x >= step]
ia = rep(0, length(a))
ib = cut(b, seq(step, max(b) + window, by = window))
c(ia, ib)
}
roll <- function(df, window, step = 0, fun, ...){
tapply(df$measure, partition(df$time, window, step), fun, ...)
}
roll_steps <- function(df, window, steps, fun, ...){
X = lapply(steps, roll, df = df, window = window, fun = fun, ...)
names(X) = steps
X
}
第一个例子的输出:
> roll_steps(dat, 5, c(0, 2.5), mean)
$`0`
1 2 3 4 5
NA 1.0126639 0.9514456 NA NA
$`2.5`
0 1 2 3 4
1.0222694 NA 0.9965048 1.0518228 NA
您也可以通过这种方式轻松忽略缺失值:
> roll_steps(dat, 5, c(0, 2.5), mean, na.rm = TRUE)
$`0`
1 2 3 4 5
0.7275438 1.0126639 0.9514456 0.9351326 NaN
$`2.5`
0 1 2 3 4
1.0222694 0.8138012 0.9965048 1.0518228 0.6122983
这也可以用于data.frames列表:
> x = lapply(dat2, roll_steps, 5, c(0, 2.5), mean)
答案 3 :(得分:2)
好的,这个怎么样。
library(data.table)
dat <- data.table(dat)
setkey(dat, time)
# function to compute a given stat over a time window on a given data.table
window_summary <- function(start_tm, window_len, stat_fn, my_dt) {
pos_vec <- my_dt[, which(time>=start_tm & time<=start_tm+window_len)]
return(stat_fn(my_dt$measure[pos_vec]))
}
# a vector of window start times
start_vec <- seq(from=-2.5, to=dat$time[nrow(dat)], by=2.5)
# sapply'ing the function above over vector of start times
# (in this case, getting mean over 5 second windows)
result <- sapply(start_vec, window_summary,
window_len=5, stat_fn=mean, my_dt=dat)
在我的机器上,它以13.06781秒的速度处理大型数据集的前20,000行; 51.58614秒内的所有行
答案 4 :(得分:2)
这是使用纯data.table
方法及其between
函数的另一种尝试。
将Rprof
与上述答案进行了比较(@Rolands答案除外),它似乎是最优化的答案。
虽然没有测试过bug,但是如果你喜欢它,我会扩展答案。
使用上面的dat
library(data.table)
Rollfunc <- function(dat, time, measure, wind = 5, slide = 2.5, FUN = mean, ...){
temp <- seq.int(-slide, max(dat$time), by = slide)
temp <- cbind(temp, temp + wind)
setDT(dat)[, apply(temp, 1, function(x) FUN(measure[between(time, x[1], x[2])], ...))]
}
Rollfunc(dat, time, measure, 5, 2.5)
## [1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA
## [10] NA
您还可以指定函数及其参数,例如:
Rollfunc(dat, time, measure, 5, 2.5, max, na.rm = TRUE)
也可以使用
编辑:我对@Roland做了一些技巧,他的方法明显胜出(到目前为止),所以我会选择Rcpp aproach