Question

我有一个数据源，它在某个未知的时间间隔内有重复的值，使事情变得更复杂，数据重复的次数可能不是整数。这是一个人为的例子：

set.seed(1)
Values <- sample(1:10,10,replace=T)
Values
 [1]  3  4  6 10  3  9 10  7  7  1

CombinedValues <- c(Values,Values,Values[1:5])
 [1]  3  4  6 10  3  9 10  7  7  1  3  4  6 10  3  9 10  7  7  1  3  4  6 10  3

我的问题是，只给出了向量CombinedValues，导出最长重复＆＃34;模式的最有效方法是什么？（又名Values）以有效的方式，因为我们不知道重复的向量有多长？我的预期输出是向量Values或描述模式重复位置索引的东西。

现有套餐是否已具备此功能？

澄清

数据源仅包含重复序列
模式重复至少两次
我们也知道数据以模式开头。
图案不重叠。所以期望的输出是最长的非重叠模式。

Answer 1

set.seed(1)
Values <- sample(1:10,10,replace=T)
CombinedValues <- c(Values,Values,Values[1:5])

max_seq <- function(x)
{
  max_seq_len=0
  for(i in 1:floor(length(x)/2))
  {
   y = split(x, ceiling(seq_along(x)/i))
   lengths=sapply(y,length)
   if(length(unique(y[which(lengths==max(lengths))]))==1)
   {max_seq_len=i}
  }
  return(max_seq_len)
}

max_seq(CombinedValues)

这将返回10，CombinedValues[1:max_seq(CombinedValues)]将返回您的数组：

[1]  3  4  6 10  3  9 10  7  7  1

希望这有帮助。

Answer 2

我找到的解决方案是使用rollapply包中的zoo。我假设模式至少是一定的长度，并且获得误报的可能性很低。

which(rollapply(CombinedValues, 4, FUN=function(x) all(x == Values[1:4])))

在这种情况下，获得4场比赛是一行没有假阳性概率的低。但是在我可以增加4到1000的数据中，这很适合快速解决方案。

查找不同数字的重复序列

2 个答案: