Question

我需要在字符串中找到最长的序列，但需要注意序列必须重复三次或更多次。因此，例如，如果我的字符串是：

fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld

然后我希望返回值“ helloworld ”。

我知道有几种方法可以实现这个目标，但我遇到的问题是实际的字符串非常大，所以我真的在寻找一种可以及时完成的方法。

Answer 1

此问题是longest repeated substring problem的变体，并且有一个使用suffix trees的O（n）-time算法来解决它。这个想法（由维基百科建议）是构造一个后缀树（时间O（n）），使用后代数量注释树中的所有节点（使用DFS的时间O（n）），然后找到树中最深的节点，至少有三个后代（使用DFS的时间为O（n））。这个整体算法需要时间O（n）。

也就是说，后缀树很难构建，因此您可能希望在尝试此实现之前找到一个为您实现后缀树的Python库。快速谷歌搜索出现this library，但我不确定这是否是一个很好的实现。

希望这有帮助！

Answer 2

使用defaultdict计算从输入字符串中的每个位置开始的每个子字符串。 OP不清楚是否应该包括重叠匹配，这种强力方法包括它们。

from collections import defaultdict

def getsubs(loc, s):
    substr = s[loc:]
    i = -1
    while(substr):
        yield substr
        substr = s[loc:i]
        i -= 1

def longestRepetitiveSubstring(r, minocc=3):
    occ = defaultdict(int)
    # tally all occurrences of all substrings
    for i in range(len(r)):
        for sub in getsubs(i,r):
            occ[sub] += 1

    # filter out all substrings with fewer than minocc occurrences
    occ_minocc = [k for k,v in occ.items() if v >= minocc]

    if occ_minocc:
        maxkey =  max(occ_minocc, key=len)
        return maxkey, occ[maxkey]
    else:
        raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))

打印：

('helloworld', 3)

Answer 3

让我们从最后开始，计算频率，并在最频繁的元素出现3次或更多次后立即停止。

from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1)[::-1]:
    substrings=[a[i:i+n] for i in range(len(a)-n+1)]
    freqs=Counter(substrings)
    if freqs.most_common(1)[0][1]>=3:
        seq=freqs.most_common(1)[0][0]
        break
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

结果：

>>> sequence 'helloworld' of length 10 occurs 3 or more times

编辑：如果你觉得你正在处理随机输入而且公共子串的长度应该很小，你最好用小子串启动（如果你需要速度）并停止当你找不到至少出现3次的时候：

from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1):
    substrings=[a[i:i+n] for i in range(len(a)-n+1)]
    freqs=Counter(substrings)
    if freqs.most_common(1)[0][1]<3:
        n-=1
        break
    else:
        seq=freqs.most_common(1)[0][0]
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

与上述结果相同。

Answer 4

首先想到的是使用逐渐变大的正则表达式进行搜索：

import re

text = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
largest = ''
i = 1

while 1:
    m = re.search("(" + ("\w" * i) + ").*\\1.*\\1", text)
    if not m:
        break
    largest = m.group(1)
    i += 1

print largest    # helloworld

代码成功运行。时间复杂度似乎至少为O（n ^ 2）。

Answer 5

如果您反转输入字符串，则将其提供给(.+)(?:.*\1){2}之类的正则表达式它应该给你重复3次最长的字符串。（反向捕获第1组的答案）

编辑：
我不得不说取消这种方式。这取决于第一场比赛。除非到目前为止针对curr长度与最大长度进行测试，否则在一个itterative循环中，regex将不适用于此。

Answer 6

from collections import Counter

def Longest(string):

    b = []
    le = []

    for i in set(string):

        for j in range(Counter(string)[i]+1): 
            b.append(i* (j+1))

    for i in b:
        if i in string:
            le.append(i)


    return ([s for s in le if len(s)==len(max( le , key = len))])

在字符串中查找最长的重复序列

6 个答案: