Question

我有一个大字符串说“aaaaaaaaaaabbbbbbbbbcccccccccccdddddddddddd”（但可能更长），我有很多小字符串的集合。我想计算（重叠是好的）在大字符串中找到小字符串的次数。我只关心速度。 KMP似乎很好，但看起来Rabin-Karp处理了多个但很慢。

Answer 1

大多数字符串搜索算法的问题是它们将至少花费时间O（k）来返回k个匹配，所以如果你有一个字符串，说100万“a”，并且有100万个小字符串查询“a”，然后需要大约100万次迭代来计算所有比赛！

另一种线性时间方法是：

构造一个大字符串的后缀树：O（n）其中n是len（大字符串）
预先计算后缀树中每个节点下面的后缀数：O（n）
对于每个小字符串，找到后缀树中具有小字符串作为后缀的节点：O（m）其中m是len（小字符串）
将总节点数添加到该节点下方的后缀数。（每个后缀对应于大字符串中小字符串的不同匹配）

这需要时间O（n + p），其中n是大字符串的长度，p是所有小字符串的总长度。

示例代码

根据要求，这里有一些Python中使用这种方法的小（ish）示例代码：

from collections import defaultdict

class SuffixTree:
    def __init__(self):
        """Returns an empty suffix tree"""
        self.T=''
        self.E={}
        self.nodes=[-1] # 0th node is empty string

    def add(self,s):
        """Adds the input string to the suffix tree.

        This inserts all substrings into the tree.
        End the string with a unique character if you want a leaf-node for every suffix.

        Produces an edge graph keyed by (node,character) that gives (first,last,end)
        This means that the edge has characters from T[first:last+1] and goes to node end."""
        origin,first,last = 0,len(self.T),len(self.T)-1
        self.T+=s
        nc = len(self.nodes)
        self.nodes += [-1]*(2*len(s))
        T=self.T
        E=self.E
        nodes=self.nodes

        Lm1=len(T)-1
        for last_char_index in xrange(first,len(T)):
            c=T[last_char_index]
            last_parent_node = -1                    
            while 1:
                parent_node = origin
                if first>last:
                    if (origin,c) in E:
                        break             
                else:
                    key = origin,T[first]
                    edge_first, edge_last, edge_end = E[key]
                    span = last - first
                    A = edge_first+span
                    m = T[A+1]
                    if m==c:
                        break
                    E[key] = (edge_first, A, nc)
                    nodes[nc] = origin
                    E[nc,m] = (A+1,edge_last,edge_end)
                    parent_node = nc
                    nc+=1  
                E[parent_node,c] = (last_char_index, Lm1, nc)
                nc+=1  
                if last_parent_node>0:
                    nodes[last_parent_node] = parent_node
                last_parent_node = parent_node
                if origin==0:
                    first+=1
                else:
                    origin = nodes[origin]

                if first <= last:
                    edge_first,edge_last,edge_end=E[origin,T[first]]
                    span = edge_last-edge_first
                    while span <= last - first:
                        first+=span+1
                        origin = edge_end
                        if first <= last:
                            edge_first,edge_last,edge_end = E[origin,T[first]]
                            span = edge_last - edge_first

            if last_parent_node>0:
                nodes[last_parent_node] = parent_node
            last+=1
            if first <= last:
                    edge_first,edge_last,edge_end=E[origin,T[first]]
                    span = edge_last-edge_first
                    while span <= last - first:
                        first+=span+1
                        origin = edge_end
                        if first <= last:
                            edge_first,edge_last,edge_end = E[origin,T[first]]
                            span = edge_last - edge_first
        return self


    def make_choices(self):
        """Construct a sorted list for each node of the possible continuing characters"""
        choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
        for (origin,c),edge in self.E.items():
            choices[origin].append(c)
        choices=[sorted(s) for s in choices] # should not have any repeats by construction
        self.choices=choices
        return choices


    def count_suffixes(self,term):
        """Recurses through the tree finding how many suffixes are based at each node.
        Strings assumed to use term as the terminating character"""
        C = self.suffix_counts = [0]*len(self.nodes)
        choices = self.make_choices()
        def f(node=0):
            t=0
            X=choices[node]
            if len(X)==0:
                t+=1 # this node is a leaf node
            else:
                for c in X:
                    if c==term:
                        t+=1
                        continue
                    first,last,end = self.E[node,c]
                    t+=f(end)
            C[node]=t
            return t
        return f()

    def count_matches(self,needle):
        """Return the count of matches for this needle in the suffix tree"""
        i=0
        node=0
        E=self.E
        T=self.T
        while i<len(needle):
            c=needle[i]
            key=node,c
            if key not in E:
                return 0
            first,last,node = E[key]
            while i<len(needle) and first<=last:
                if needle[i]!=T[first]:
                    return 0
                i+=1
                first+=1
        return self.suffix_counts[node]


big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd"
small_strings=["a","ab","abc"]
s=SuffixTree()
term=chr(0)
s.add(big+term)
s.count_suffixes(term)
for needle in small_strings:
    x=s.count_matches(needle)
    print needle,'has',x,'matches'

打印：

a has 11 matches 
ab has 1 matches 
abc has 0 matches

然而，在实践中，我建议您只使用预先存在的Aho-Corasick实现，因为我希望在您的特定情况下这会更快。

Answer 2

对我来说，匹配大量字符串听起来像http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm。它确实一次找到一个匹配，所以如果有大量的匹配，彼得德里瓦兹的想法可能会更好。另一方面，Aho-Corasick不需要将大字符串保留在内存中 - 您可以直接将其传输 - 并且实现和调整非常实用 - 维基百科链接指出原始fgrep使用它。

考虑到这一点，你可以解决大型比赛问题。 Aho-Corasick可以被视为创建一个确定性的有限状态机，只能识别它正在搜索的每个字符串。机器的状态对应于看到的最后N个字符。如果你想匹配两个字符串，一个是另一个字符串的后缀，你需要注意当你处于说你刚刚匹配较长字符串的状态时你也认识到这意味着你刚刚匹配了较短的字符串串。如果您故意选择不执行此操作，那么您为较短字符串累积的计数将是不正确的 - 但是您知道它们的长度太低，看到的字符串越长。因此，如果您修改Aho-Corasick以便只识别和计算每个点的最长匹配，那么匹配的成本在您搜索的字符串中的字符数中保持线性，并且您可以在最后修复计数通过遍历长字符串，然后递增较短字符串的计数，这些字符串是长字符串的后缀。在搜索的字符串的总大小中，这将花费最多的线性时间。

Answer 3

在另一个答案的基础上（并希望说服你这是最好的答案类型），你可以查看http://en.wikipedia.org/wiki/Suffix_tree并查看那里列出的参考文献，了解后缀树，如果你真的想要最快的解决方案对于您的问题，这也可以在不迭代所有匹配的情况下获得匹配数量，并且您获得的运行时间和内存要求对于任何子字符串匹配或匹配计数算法都是绝对可能的。一旦你理解了后缀树如何工作以及如何构建/使用它，那么你需要的唯一额外调整是存储树的每个内部节点上表示的不同字符串起始位置的数量，这是一个小的修改，你可以通过递归地从子节点获取计数并将它们相加以获得当前节点的计数，可以轻松有效地执行（线性时间，如已声明的那样）。然后，这些计数允许您计算子串匹配而不迭代所有这些。

Answer 4

1）我会选择有限自动机。现在想不到专门的库，但是通用PCRE可以用来构建一个有效搜索给定子串的自动机。对于子字符串“foo”和“bar”，可以构造一个模式/（foo）|（bar）/，扫描一个字符串并通过迭代ovector检查哪个组匹配来获得匹配子字符串的“id”数。
RE2 :: FindAndConsume很好，如果你只需要总计数，而不是按子串分组。
P.S。使用Boost.Xpressive并从地图加载字符串的示例：http://ericniebler.com/2010/09/27/boost-xpressive-ftw/
P.P.S。最近我很开心为类似的任务创建一台Ragel机器。对于一小组搜索字符串，“普通”DFA可以正常工作，如果您有更大的规则集，则使用Ragel扫描仪显示效果良好（此处为related answer）。 P.P.P.S。 PCRE具有MARK关键字，对于这种子模式分类（cf）非常有用。

2）很久以前我在Scala中写了一个基于Trie的东西，用于那种负载：https://gist.github.com/ArtemGr/6150594; Trie.search遍历字符串，试图将当前位置与Trie中编码的数字相匹配。 trie编码在一个缓存友好的数组中，我希望它与非JIT DFA一样好。

3）我一直在使用boost :: spirit进行子串匹配，但从未测量过它与其他方法的对比情况。 Spirit为symbols匹配使用了一些有效的结构，也许结构可以单独使用而不需要Spirit的开销。

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
using qi::lit; using qi::alnum; using qi::digit; using qi::_val; using qi::_1; using boost::phoenix::ref;
static struct exact_t: qi::symbols<char, int> {
  exact_t() {add
    ("foo", 1)
    ("bar", 2);}
} exact;
int32_t match = -1;
qi::rule<const char*, int()> rule =
  +alnum >> exact [ref (match) = _1];
const char* it = haystack; // Mutable iterator for Spirit.
qi::parse (it, haystack + haystackLen, rule);

Answer 5

如果我理解正确，你的输入字符串由许多单字符块组成。

在这种情况下，您可以使用Run-length encoding压缩文字。

例如：

s = aaabbbbcc

编码为：

encoded_s = (a3)(b4)(c2)

现在您可以搜索编码文本中的模式。

如果你想要一个具体的算法，只需在网上搜索与游程编码字符串匹配的模式。您可以实现时间复杂度O(N + M)，其中N和M是压缩文本和压缩模式的长度。 M和N通常都比原始长度小得多，因此它胜过任何标准模式匹配算法，例如： KMP。

哪种字符串查找算法适用于此？

5 个答案:

示例代码