Question

我有一个长时间序列，我需要在R中识别和标记重复的值序列。假设我有以下向量：

a <- c(1,2,3,4,88,443,756,2,453,6,21,98,1,2,3,4,65)

请注意，序列1,2,3,4在开头重复，几乎在结束时重复。我想在很长的时间序列中识别并标记序列n （n可以设置）重复数字。这就是为什么我需要一种强大的方法来做到这一点。

非常感谢。

Answer 1

您可以使用此功能：

identRptSeq <- function(x, N = 4) {
    # Create groups to split input vector in
    splits <- ceiling(seq_along(x) / N)
    # Use data.table shift to create overlapping windows
    foo <- lapply(data.table::shift(x, 0:(N-1), type = "lead"), function(x) {
                  res <- split(x, splits)
                  res[lengths(res) == N]})
    foo <- na.omit(t(as.data.frame(foo)))
    # Find duplicated windows
    foo[duplicated(foo), ]
}

# OPs input
a <- c(1,2,3,4,88,443,756,2,453,6,21,98,1,2,3,4,65)

# Duplicated sequence when N = 4
identRptSeq(a, 4)
[1] 1 2 3 4

# Duplicated sequences when N = 3
identRptSeq(a, 3)
     [,1] [,2] [,3]
X5      1    2    3
X5.1    2    3    4

PS，请记住，当N = 1时它不起作用（R中还有其他方法）

Answer 2

如果你有完全重复的模式，这只是O（n）。（只是散列序列并寻找碰撞）

如果你有几乎重复的模式（并且通过欧几里德距离或相关性测量相似性），那么这是O（N ^ 2），但是矩阵轮廓算法非常快[a]。

[a] http://www.cs.ucr.edu/~eamonn/MatrixProfile.html

识别时间序列中重复的数字序列

2 个答案: