Question

我看到，为了在字符串中找到重复的字符模式，甚至在2个字符串之间，建议使用后缀树。我尝试做的事情略有不同。

我有一个在列表中排序的字符串列表（按顺序排列，我的意思是第二个字符串严格地在第一个字符串之后，第三个字符串在第二个字符串之后，依此类推）。我想在整个字符串集中找到长度为3-7的重复字符串集。这个问题的数据结构和算法选择是什么？

字符串的数量至少为15K（假设最多为30K）。每个字符串的长度范围约为3-35个字符。

我喜欢第一个重复子阵列中的字符串，以及有关子阵列在整个列表中重复多少次的信息（我不需要重复的位置）。

一个例子： [＆＃34; a＆＃34;，＆＃34; b＆＃34;，＆＃34; c＆＃34;，＆＃34; g＆＃34;，＆＃34; h＆＃34;，＆＃34; t＆＃34;，＆＃34;我＆＃34;，＆＃34; a＆＃34;，＆＃34; b＆＃34;，＆＃34; c＆＃34;，＆＃34; z＆＃34;]

这里我希望答案[＆＃34; a＆＃34;，＆＃34; b＆＃34;，＆＃34; c＆＃34;]当我通过时需要重复长度为3的字符串。这个长度可能从3-7不等。

Answer 1

这会找到第一个转发器及其发生频率：

from collections import Counter
def repeater(strings, k):
    ctr = Counter()
    first = None
    for i in range(len(strings) - k + 1):
        seq = tuple(strings[i:i+k])
        if seq in ctr and first is None:
            first = seq
        ctr[seq] += 1
    return first, ctr.get(first)

seq, ctr = repeater(["a", "b", "c", "g", "h", "t", "i", "a", "b","c", "z", "g", "h", "t"], 3)
if ctr:
    print(seq, 'occurred', ctr, 'times')

打印：

('a', 'b', 'c') occurred 2 times

没有Counter的替代方案：

def repeater(strings, k):
    seen = set()
    rep, ctr = None, None
    for i in range(len(strings) - k + 1):
        seq = tuple(strings[i:i+k])
        if rep:
            ctr += seq == rep
        elif seq in seen:
            rep, ctr = seq, 2
        else:
            seen.add(seq)
    return rep, ctr

在较大的集合中查找重复的字符串集

1 个答案: