Question

您如何根据时间和阈值对时间序列data.frame进行子集化？

我有这些数据：

year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
df <- data.frame(year, v1)

看起来像这样：

我希望按照v1上的总和得分超过10的连续年份组来对数据进行子集化。

在这个示例数据中，第一个子集应该包含对2000年和2000年的观察。第二个子集应该包含2002,2003和2004年的观察结果。

真实数据有大约800万次观测，涵盖120年。

Answer 1

您可以使用cumsum功能实现自定义Reduce，当总数超过10时重置总和，同时将计数增加为组变量：

library(data.table)
transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]), 
                 init = c(0, 1), df$v1, accumulate = T))[[2]][-1]

# here the init parameter will take two parameters, the first one keep track of the cumsum,
# and the second one serves as a group variable, when the sum exceeds 10, reset the sum to 
# zero and increase the group variable by one

# [1] 1 1 2 2 2 3 3 3 3 4

运行超过1000万个观测矢量需要大约20秒钟：

v = sample(1:10, 10000000, replace = T)
system.time(transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]), init = c(0, 1), v, accumulate = T))[[2]])

#   user  system elapsed 
# 19.509   0.552  20.081

r - 基于时间和阈值对时间序列数据进行子集

1 个答案: