查找段落中的所有重复模式

时间:2016-10-23 18:56:39

标签: algorithm language-agnostic pattern-matching suffix-tree substring

我手边有一个问题,我必须找到句子中存在的所有重复模式。

示例:'camel horse game camel horse gym camel horse game' # This is the sanitized string as I will cleanup anything other than words before it.

['camel horse game', 0, 3, 6] # pattern and Index where it is repeated
['camel horse', 0, 3, 6] # Another pattern, let it be a substring of the previous pattern

后缀树是一个很好的解决方案,但我无法理解如何为WORDS而不是字母/字符实现它?

使用标准Duplicate Substringss solution将无效,因为它会找到带有缺口/半字的模式。 - > 'camel horse', 'amel hor' .... 'am h'实际上没有任何用处。

提前致谢。

3 个答案:

答案 0 :(得分:2)

您可以为您喜欢的任何字母构建后缀树。想象一下,您创建了一个字母表,其中段落中的每个不同的单词都被视为一个字母。然后,后缀树将允许您在段落中找到重复的单词序列,而不会将单词分成单个字符。

答案 1 :(得分:0)

我用ruby语言发现了这个实现: - http://rubyquiz.com/quiz153.html

可以修改它以查找所有重复的子字符串。它有一个自定义实现后缀树。

答案 2 :(得分:0)

def all_repeated_substrings
  patterns = {}
  size = $string.length

  suffixes = Array.new(size)
  size.times do |i|
    suffixes[i] = $string.slice(i, size)
  end

  suffixes.sort!

  recurrence = ''
  at_least_size = 2 # the size to meet or exceed to be the new recurrence
  distance = nil
  neighbors_to_check = 1

  (1...size).each do |i|
    s1 = suffixes[i]
    neighbors_to_check.downto(1) do |neighbor|
      s2 = suffixes[i - neighbor]
      s1_size = s1.size
      s2_size = s2.size
      distance = (s1_size - s2_size).abs
      next if distance < at_least_size
      recurrence = longest_common_prefix(s1, s2, distance)
      if recurrence.size > 1
        if patterns[:"#{recurrence}"]
          patterns[:"#{recurrence}"] << (size - s2_size)
        else
          patterns[:"#{recurrence}"] = [(size - s2_size), (size - s1_size)]
        end
      end
      at_least_size = recurrence.size + 1
      if recurrence.size == distance
        neighbors_to_check = [neighbors_to_check, neighbor + 1].max
      else
        neighbors_to_check = neighbor
      end
    end
  end
  return patterns
end

改进:http://rubyquiz.com/quiz153.html解决上述问题。 我猜,但是有一个问题,它不适用于&#39; aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&#39;一种循环模式。 欢迎任何人改进上述代码以实现循环模式。