我手边有一个问题,我必须找到句子中存在的所有重复模式。
示例:'camel horse game camel horse gym camel horse game' # This is the sanitized string as I will cleanup anything other than words before it.
['camel horse game', 0, 3, 6] # pattern and Index where it is repeated
['camel horse', 0, 3, 6] # Another pattern, let it be a substring of the previous pattern
后缀树是一个很好的解决方案,但我无法理解如何为WORDS而不是字母/字符实现它?
使用标准Duplicate Substringss solution
将无效,因为它会找到带有缺口/半字的模式。 - > 'camel horse', 'amel hor' .... 'am h'
实际上没有任何用处。
提前致谢。
答案 0 :(得分:2)
您可以为您喜欢的任何字母构建后缀树。想象一下,您创建了一个字母表,其中段落中的每个不同的单词都被视为一个字母。然后,后缀树将允许您在段落中找到重复的单词序列,而不会将单词分成单个字符。
答案 1 :(得分:0)
我用ruby语言发现了这个实现: - http://rubyquiz.com/quiz153.html
可以修改它以查找所有重复的子字符串。它有一个自定义实现后缀树。
答案 2 :(得分:0)
def all_repeated_substrings
patterns = {}
size = $string.length
suffixes = Array.new(size)
size.times do |i|
suffixes[i] = $string.slice(i, size)
end
suffixes.sort!
recurrence = ''
at_least_size = 2 # the size to meet or exceed to be the new recurrence
distance = nil
neighbors_to_check = 1
(1...size).each do |i|
s1 = suffixes[i]
neighbors_to_check.downto(1) do |neighbor|
s2 = suffixes[i - neighbor]
s1_size = s1.size
s2_size = s2.size
distance = (s1_size - s2_size).abs
next if distance < at_least_size
recurrence = longest_common_prefix(s1, s2, distance)
if recurrence.size > 1
if patterns[:"#{recurrence}"]
patterns[:"#{recurrence}"] << (size - s2_size)
else
patterns[:"#{recurrence}"] = [(size - s2_size), (size - s1_size)]
end
end
at_least_size = recurrence.size + 1
if recurrence.size == distance
neighbors_to_check = [neighbors_to_check, neighbor + 1].max
else
neighbors_to_check = neighbor
end
end
end
return patterns
end
改进:http://rubyquiz.com/quiz153.html解决上述问题。 我猜,但是有一个问题,它不适用于&#39; aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&#39;一种循环模式。 欢迎任何人改进上述代码以实现循环模式。