在较大的集合中查找重复的字符串集

时间:2015-05-09 19:15:16

标签: python string algorithm

我看到,为了在字符串中找到重复的字符模式,甚至在2个字符串之间,建议使用后缀树。我尝试做的事情略有不同。

我有一个在列表中排序的字符串列表(按顺序排列,我的意思是第二个字符串严格地在第一个字符串之后,第三个字符串在第二个字符串之后,依此类推)。我想在整个字符串集中找到长度为3-7的重复字符串集。这个问题的数据结构和算法选择是什么?

字符串的数量至少为15K(假设最多为30K)。每个字符串的长度范围约为3-35个字符。

我喜欢第一个重复子阵列中的字符串,以及有关子阵列在整个列表中重复多少次的信息(我不需要重复的位置)。

一个例子: [" a"," b"," c"," g"," h"," t","我"," a"," b"," c"," z"]

这里我希望答案[" a"," b"," c"]当我通过时需要重复长度为3的字符串。这个长度可能从3-7不等。

1 个答案:

答案 0 :(得分:1)

这会找到第一个转发器及其发生频率:

from collections import Counter
def repeater(strings, k):
    ctr = Counter()
    first = None
    for i in range(len(strings) - k + 1):
        seq = tuple(strings[i:i+k])
        if seq in ctr and first is None:
            first = seq
        ctr[seq] += 1
    return first, ctr.get(first)

seq, ctr = repeater(["a", "b", "c", "g", "h", "t", "i", "a", "b","c", "z", "g", "h", "t"], 3)
if ctr:
    print(seq, 'occurred', ctr, 'times')

打印:

('a', 'b', 'c') occurred 2 times

没有Counter的替代方案:

def repeater(strings, k):
    seen = set()
    rep, ctr = None, None
    for i in range(len(strings) - k + 1):
        seq = tuple(strings[i:i+k])
        if rep:
            ctr += seq == rep
        elif seq in seen:
            rep, ctr = seq, 2
        else:
            seen.add(seq)
    return rep, ctr