Question

我想在数据向量中找到平均值低于某个阈值的所有运行。例如。对于数据集

d <- c(0.16, 0.24, 0.15, 0.17, 0.37, 0.14, 0.12, 0.08)

如果我想找到平均值小于或等于0.20的所有运行，零指数运行1-6将不识别（平均0.205）但是1-7（平均值0.193）将...其他人。

为了简单起见，我不关心已经确定平均值低于阈值的运行子集。即按照这个例子，如果我已经知道1-7低于阈值，我就不需要检查1-6。但我仍然需要检查其他运行，其中包括运行1-7并且不是它的子集（例如2-8）。

为了回答这个问题，我发现我可以从类似于this的内容开始，例如

hour <- c(1, 2, 3, 4, 5, 6, 7, 8)
value <- c(0.16, 0.24, 0.15, 0.17, 0.37, 0.14, 0.12, 0.08)
d <- data.frame(hour, value)

rng <- rev(1:length(d$value))

data.table::setDT(d)[, paste0('MA', rng) := lapply(rng, function(x) 
    zoo::rollmeanr(value, x, fill = NA))][]

然后在所有生成的列中搜索阈值以下的值。

但是这个方法对于我想要达到的目标效率不高（它查看已在阈值下识别的所有运行子集）并且不能很好地处理大型数据集（意味着大约500k条目......然后我将有一个500k x 500k矩阵。）

相反，在单独的变量中记录阈值下的运行指数就足够了。这至少可以避免创建500k x 500k矩阵。但我不确定如何检查rollmeanr()的输出是否在某个值之下，如果是，则获取相关的索引。

Answer 1

首先，请注意mean(x) <= threshold当且仅当sum(x - threshold) <= 0。

其次，找到具有非正和的d的运行等同于找到c(0, cumsum(d))的第二个值低于或等于其第一个值的夫妇。

因此：

s <- c(0, cumsum(d - threshold))

# potential start points of *maximal* runs:
B <- which(!duplicated(cummax(s)))
# potential end points:
E <- which(!duplicated(rev(cummin(rev(s))), fromLast = TRUE))

# end point associated with each start point
# (= for each point of B, we find the *last* point of E which is smaller)
E2 <- E[findInterval(s[B], s[E])] - 1

# potential maximal runs:
df <- data.frame(begin = B, end = E2)

# now we just have to filter out lines with begin > end, and keep only the 
# first begin for each end - for instance using dplyr:
df %>%
  filter(begin <= end) %>%
  group_by(end) %>%
  summarise(begin = min(begin))

查找低于阈值的任何长度的滚动平均值

1 个答案: