我看到,为了在字符串中找到重复的字符模式,甚至在2个字符串之间,建议使用后缀树。我尝试做的事情略有不同。
我有一个在列表中排序的字符串列表(按顺序排列,我的意思是第二个字符串严格地在第一个字符串之后,第三个字符串在第二个字符串之后,依此类推)。我想在整个字符串集中找到长度为3-7的重复字符串集。这个问题的数据结构和算法选择是什么?
字符串的数量至少为15K(假设最多为30K)。每个字符串的长度范围约为3-35个字符。
我喜欢第一个重复子阵列中的字符串,以及有关子阵列在整个列表中重复多少次的信息(我不需要重复的位置)。
一个例子: [" a"," b"," c"," g"," h"," t","我"," a"," b"," c"," z"]
这里我希望答案[" a"," b"," c"]当我通过时需要重复长度为3的字符串。这个长度可能从3-7不等。
答案 0 :(得分:1)
这会找到第一个转发器及其发生频率:
from collections import Counter
def repeater(strings, k):
ctr = Counter()
first = None
for i in range(len(strings) - k + 1):
seq = tuple(strings[i:i+k])
if seq in ctr and first is None:
first = seq
ctr[seq] += 1
return first, ctr.get(first)
seq, ctr = repeater(["a", "b", "c", "g", "h", "t", "i", "a", "b","c", "z", "g", "h", "t"], 3)
if ctr:
print(seq, 'occurred', ctr, 'times')
打印:
('a', 'b', 'c') occurred 2 times
没有Counter
的替代方案:
def repeater(strings, k):
seen = set()
rep, ctr = None, None
for i in range(len(strings) - k + 1):
seq = tuple(strings[i:i+k])
if rep:
ctr += seq == rep
elif seq in seen:
rep, ctr = seq, 2
else:
seen.add(seq)
return rep, ctr