我正在分析时间序列信号。我设置了一个阈值以将噪声与基线噪声分开。为了识别每个信号序列的特性(持续时间,幅度,最大信号...),我构建了一个函数,可将连续的所有信号点聚合为不同的“峰值”。尽管此功能可以实现我想要的功能,但我想知道是否有人可以帮助我提高效率-e。 G。向量化,因为我的目标是在超过1M行的data.table上运行该函数。以下是具有该功能的示例数据:
# Generate dummy data
x <- sin(seq(from = 0, to = 20, length.out = 200)) + rnorm(200, 0,0.1)
x <- zoo(x)
plot(x)
# Label each point as signal (== )1) or noise (0)
y <- ifelse(x > 0.5, 1, 0)
# Function to label each peak
peak_labeler <- function(x) {
tmp <- NULL
for (i in seq_along(x)) {
if (x[i] == 0) { tmp[i] <- 0 } # If baseline, mark as 0
if (x[i] == 1) {
# If x[n] belongs to a peak
if (i == 1) {tmp[i] <- 1} # Label as 1 at t0
else{
if (!exists("Peak")) {Peak <- 0}
if (x[i - 1] == 0) {
# if previous point is no peak, add as peak
Peak <- Peak + 1
tmp[i] <- Peak
}
if (x[i - 1] == 1) {
tmp[i] <- Peak
}
}
}
}
return(tmp)
rm(tmp, Peak, i) # Garbage collection
}
# Label peaks
dummy <- data.frame(t = 1:200, x,y,tmp = peak_labeler(y))
# Show data
ggplot(dummy, aes(x = t, y = x)) +
geom_point(aes(col = as.factor(tmp), group = 1))
答案 0 :(得分:0)
这是使用dplyr
的方法。
cross_threshold
行中的测试通过评估y与先前y是否在0.5的另一侧而起作用。如果是这样,则两个项y - threshold
和lag(y) - threshold
的符号将不同,从而导致TRUE,该值乘以1便成为1
。如果它们在0.5的同一边,您将得到FALSE和0
。 default = 0
部分处理第一行,其中lag(y)未定义。然后,我们累加定义tmp
组的累计交叉次数。
library(dplyr)
threshold = 0.5
dummy <- data.frame(t = 1:200, x, y) %>%
mutate(cross_threshold = 1 * (sign(y - threshold) != sign(lag(y, default = 0) - threshold)),
# Line above now optional, just if we want to label all crossings
up = 1 * ((y > threshold) & (lag(y) < threshold)),
tmp = if_else(y > threshold, cumsum(up), 0))
ggplot(dummy, aes(x = t, y = x)) +
geom_point(aes(col = as.factor(tmp), group = 1)) +
geom_point(data = filter(dummy, cross_threshold == 1), shape = 21, size = 5)