想象一下,有一个很大的字符串S数组。从该数组中,我只需要获取包含特定子字符串的那些字符串。例如,如果我的数组是 String s [] = {"hello world", "back to hell", "say hello world"}; 而我的关键字是“ hello”,那么它应该返回我的第一个和最后一个元素。 我尝试使用KMP和Boyer-Moor算法来检查数组中的每个字符串是否包含子字符串,但是这花费了太多时间。 然后,我了解了Aho-Corasick算法。我仍然在查找它,但似乎它需要一个子字符串数组和一个大字符串来匹配,而我想要的却恰恰相反。 因此,我一直在寻找有关如何针对我的目的修改Aho-Corasick算法或实现这些目的的另一种建议。谢谢您的建议。

使用Ukkonen算法或this source(PDF)中建议的算法来构建后缀树:


McCreight的算法可以轻松调整,以为一组 S = {s1,s2,。。。建立通用后缀树。 。 。 ,s_k} O(N)时间内总长度N 的字符串...



模式匹配问题可以在最佳 O(m + k)时间内解决,其中k是T中P出现的次数


编辑:如果您很着急并且不介意额外的时间复杂性,则可以使用this approach(PDF)在O(n中构造后缀数组而不是后缀树* log ^ 2(n))和一小段代码。这是这种方法的核心思想:


该算法主要基于维护以2 ^ k个长前缀排序的字符串后缀的顺序。


n ←length(T) 
  for i←0 : n – 1
    P(0, i)← position of T(i) in the ordered array of T‘s characters 
cnt ← 1 
for k←1 : [log2n] (ceil)
  for i←0 : n – 1
    L(i)← (P(k – 1, i), P(k – 1, i + cnt), i)             
    sort L
    compute P(k, i) , i = 0, n - 1 
    cnt←2 * cnt



由于后缀数组提供了T后缀的顺序,因此搜索   将字符串P转换为T很容易通过二进制搜索完成。由于比较   在O(| P |)


from collections import namedtuple

SuffixTree = namedtuple('SuffixTree', 'first_pos next_tree children')
SuffixForest = namedtuple('SuffixForest', 'first_string_pos first_tree next_forest children strings')

# This is Ukkonen's Suffix Tree algorithm.
# It is O(n * c) memory and time for a string of length n with c different characters.
def build_suffix_tree (string):
    children = {}
    # Building the from the end means that the suffixes are arranged in order.
    for i in range(len(string) - 1, -1, -1):
        char = string[i]
        if char in children:
            node = SuffixTree(first_pos=i, children=children.copy(), next_tree = children[char])
            node = SuffixTree(first_pos=i, children=children.copy(), next_tree=None)
        children[char] = node

    # And our final tree!
    return SuffixTree(first_pos=-1, children=children, next_tree=None)

# This returns an array of positions that match.
def match_suffix_tree (tree, string):
    # Navigate the tree to find the match.
    for c in string:
        if c not in tree.children:
            return []
        tree = tree.children[c]

    # Turn the match into an easily understood answer.
    answer = []
    while tree is not None:
        answer.append(tree.first_pos - len(string) + 1)
        tree = tree.next_tree
    return answer

tree = build_suffix_tree('foo')
print(match_suffix_tree(tree, 'oo'))




SuffixForest = namedtuple('SuffixForest', 'first_string_pos first_tree next_forest children strings')

# This returns a suffix forest for the matches in common across many trees.
def build_suffix_forest (strings):
    children = {}
    forest = None
    # Building the forest from the end means that the strings are arranged in order.
    for i in range(len(strings) - 1, -1, -1):
        string = strings[i]
        tree = build_suffix_tree(string)

        # This will cache both from tree and (forest, tree) pair.
        # We actually use id(...) in our keys because they are fast to hash.
        cached = {}

        # Make a forest out of a tree.
        def make_forest (t):
            # Only do work if we have not been here.
            key = id(t)
            if key not in cached:
                new_children = {}
                for c in t.children:
                    new_children[c] = make_forest(t.children[c])
                cached[key] = SuffixForest(first_string_pos=i, first_tree=t, next_forest=None,
                                           children=new_children, strings=strings)
            return cached[key]

        # Recursively record the forest.  Caching matters because we would otherwise
        # visit the same node repeatedly.
        def add_tree_to_forest (f, t):
            # Add tree t to forest f
            # Only do work if we have not been here
            key = (id(f), id(t))
            if key not in cached:
                new_children = f.children.copy()
                for c in t.children:
                    if c in new_children:
                        # Recursively merge tree into forest.
                        new_children[c] = add_tree_to_forest(new_children[c], t.children[c])
                        new_children[c] = make_forest(t.children[c])
                cached[key] = SuffixForest(first_string_pos=i, first_tree=t, next_forest=f,
                                           children=new_children, strings=strings)
            return cached[key]

        if forest is None:
            forest = make_forest(tree)
            forest = add_tree_to_forest(forest, tree)
    return forest
def match_suffix_forest(forest, string):
    # Navigate the forest to find the match.
    for c in string:
        if c not in forest.children:
            return {}
        forest = forest.children[c]

    # Now build the match in a readable format.
    answer = {}
    while forest:
        matched_string = forest.strings[forest.first_string_pos]
        tree = forest.first_tree
        positions = []
        while tree:
            positions.append(tree.first_pos - len(string) + 1)
            tree = tree.next_tree
        answer[matched_string] = positions
        forest = forest.next_forest
    return answer

forest = build_suffix_forest(['foo', 'bbar', 'bazbar'])
print(match_suffix_forest(forest, 'ba'))
