Question

给定长度为N的序列，Seq内的字符串为A，B，C，D

Input: Seq={ACCBADBAACCBACADAAADC...DBACDBACD}
Output: string appear most of the time

大多数时候是否有最快的算法来查找字符串？这意味着

示例：假设AAA只在Seq中出现一次，那么就说BAA也出现一次等等......之后发现ACCBA在Seq中出现2次，这是字符串出现的大部分时间，所以输出是ACCBA.state算法的最坏情况复杂性...

这可能有很多答案......例如暴力破解程序，但这很慢......不需要提供确切的代码...... psedo代码应该足够...... 请帮我提供一些线索或参考信息......我想学习......

Answer 1

我看到你的问题有两种解释。我将覆盖我认为最有可能的第一个。

在这种情况下，您需要计算字符的最长子序列，以查看哪个子序列发生的次数最多。换句话说，字符串AAABBBBBAAA会为您提供{2,AAA}，因为有两个最长子序列，只有BBBBB中的一个。

为此，您可以使用：

dim seqcount['A'..'D',1..len(str)] = 0   # Array to hold counts.
lastch = str[0]                          # Last character processed.
count = 1                                # Count of last char processed.
maxseqcount = 0                          # Largest quantity to date.
maxseqchars = ""                         # Letters of that largest quantity.

# Process the end of a sequence.

def endseq (thisch,thiscount):
    # Increase quantity for letter/length combo.

    seqcount[thisch,thiscount] = seqcount[thisch,thiscount] + 1

    # Quantity same as current max, add letter to list (if not already there).

    if seqcount[thisch,thiscount] == maxseqcount:
        if not maxseqchars.contains (thisch):
            maxseqchars = maxseqchars + thisch

    # Quantity greater than current max, change max and use this letter.

    if seqcount[thisch,thiscount] > maxseqcount:
        maxseqcount = seqcount[thisch,thiscount]
        maxseqchars = thisch

def main:
    # Process every character (other than first) once.

    for pos = 1 to len(str) - 1:
        # Still in a sequence, add to length and restart loop.

        if str[pos] == lastch:
            count = count + 1
            continue

        # Letter change, process end of sequence.

        endseq (lastch, count)

        # Then store this new character and re-init count.

        lastch = str[pos]
        count = 1

    # Termination, we still have the last sequence to deal with.

    endseq (lastch, count)

这将为您提供O（n）存储空间和O（n）时间，因为您可能有不同的字符（A到D）并且您只处理每个字符一次串。

在处理结束时，您可以使用{maxseqcount,maxseqchars}得到您想要的内容，尽管maxseqchars不一定只是一个字母，因为您的输入字符串可能类似于ABAB意味着{2,A}和{2,B}同等有效。

第二种（虽然不太可能）可能性是你不必使用单个字符的最长子序列，在这种情况下，最常出现的序列将是{1,whatever single character occurs the most in the string}。

从您的更新中看起来，允许序列不是所有相同的字符，后一种可能性可能是这种情况（允许任意长度的任何字符，相同或不同）。

如果是这样，那么您只需在O（n）中处理每个字符，以查看哪个单个字符出现次数最多。然后它只是该字符的长度= 1序列。

例如，如果您的字符串为ACCBA，那么{1,ACCBA} 不解决方案。相反，它是{2,A} ot {2,C}。

Answer 2

如果您的意思是想要找到最长的相同字符串，我相信这可以通过相对简单的算法在线性时间内完成。您基本上会跟踪当前最长的序列字符，它的计数，查看的最后一个字符以及连续看到该字符的次数。只需扫描并更新值即可。

有效地确定最常见的子串

2 个答案: