是否有一种有效的算法可以从字符串数组中仅提取包含特定子字符串的字符串?

时间:2019-12-23 20:23:46

标签: arrays string algorithm aho-corasick

想象一下,有一个很大的字符串S数组。从该数组中,我只需要获取包含特定子字符串的那些字符串。例如,如果我的数组是 String s [] = {"hello world", "back to hell", "say hello world"}; 而我的关键字是“ hello”,那么它应该返回我的第一个和最后一个元素。 我尝试使用KMP和Boyer-Moor算法来检查数组中的每个字符串是否包含子字符串,但是这花费了太多时间。 然后,我了解了Aho-Corasick算法。我仍然在查找它,但似乎它需要一个子字符串数组和一个大字符串来匹配,而我想要的却恰恰相反。 因此,我一直在寻找有关如何针对我的目的修改Aho-Corasick算法或实现这些目的的另一种建议。谢谢您的建议。

2 个答案:

答案 0 :(得分:1)

使用Ukkonen算法或this source(PDF)中建议的算法来构建后缀树:

  

McCreight的算法可以轻松调整,以为一组 S = {s1,s2,。。。建立通用后缀树。 。 。 ,s_k} O(N)时间内总长度N 的字符串...

然后使用创建的后缀树搜索给定的模式。问题是找到后缀树T中所有出现的模式P(长度m)。根据上述来源:

  

模式匹配问题可以在最佳 O(m + k)时间内解决,其中k是T中P出现的次数

请注意,文本的长度(或数组中的字符串数)不会影响搜索效率。因此,您可以支付一次构造后缀树的费用,然后多次使用它来有效地搜索短模式字符串。

编辑:如果您很着急并且不介意额外的时间复杂性,则可以使用this approach(PDF)在O(n中构造后缀数组而不是后缀树* log ^ 2(n))和一小段代码。这是这种方法的核心思想:

  

该算法主要基于维护以2 ^ k个长前缀排序的字符串后缀的顺序。

这是从上述来源复制来的伪代码:

n ←length(T) 
  for i←0 : n – 1
    P(0, i)← position of T(i) in the ordered array of T‘s characters 
cnt ← 1 
for k←1 : [log2n] (ceil)
  for i←0 : n – 1
    L(i)← (P(k – 1, i), P(k – 1, i + cnt), i)             
    sort L
    compute P(k, i) , i = 0, n - 1 
    cnt←2 * cnt

运行此代码后,P将包含后缀数组。使用这种方法进行搜索也很简单:

  

由于后缀数组提供了T后缀的顺序,因此搜索   将字符串P转换为T很容易通过二进制搜索完成。由于比较   在O(| P |)

中完成

答案 1 :(得分:0)

首先,您必须使用Ukkonen的算法来构建后缀树。

from collections import namedtuple

SuffixTree = namedtuple('SuffixTree', 'first_pos next_tree children')
SuffixForest = namedtuple('SuffixForest', 'first_string_pos first_tree next_forest children strings')

# This is Ukkonen's Suffix Tree algorithm.
# It is O(n * c) memory and time for a string of length n with c different characters.
def build_suffix_tree (string):
    children = {}
    # Building the from the end means that the suffixes are arranged in order.
    for i in range(len(string) - 1, -1, -1):
        char = string[i]
        if char in children:
            node = SuffixTree(first_pos=i, children=children.copy(), next_tree = children[char])
        else:
            node = SuffixTree(first_pos=i, children=children.copy(), next_tree=None)
        children[char] = node

    # And our final tree!
    return SuffixTree(first_pos=-1, children=children, next_tree=None)

# This returns an array of positions that match.
def match_suffix_tree (tree, string):
    # Navigate the tree to find the match.
    for c in string:
        if c not in tree.children:
            return []
        tree = tree.children[c]

    # Turn the match into an easily understood answer.
    answer = []
    while tree is not None:
        answer.append(tree.first_pos - len(string) + 1)
        tree = tree.next_tree
    return answer

tree = build_suffix_tree('foo')
print(match_suffix_tree(tree, 'oo'))

请注意,所有内容均在链表下方。同样的观点可以多次指出。当打印时,它看起来会像一个非常大的数据结构,但是通过构造,您在字符串中每个字符有一个节点,每个节点有一个查询表。

但是,这不是您想要的答案。您想要的答案是许多字符串。因此,我们将在树木之外建造一片森林。森林是一个非常相似的数据结构,其中每个森林都是您可能所在的树的链接列表,并且每个森林都有一个查找表。

这可能是一个相当大的数据结构,但又不会像它看起来的那样大,因为您一遍又一遍地引用相同的东西。例如,尽管有很多获取字符串的方法,但实际上实际上只保留了一个字符串列表。

SuffixForest = namedtuple('SuffixForest', 'first_string_pos first_tree next_forest children strings')

# This returns a suffix forest for the matches in common across many trees.
def build_suffix_forest (strings):
    children = {}
    forest = None
    # Building the forest from the end means that the strings are arranged in order.
    for i in range(len(strings) - 1, -1, -1):
        string = strings[i]
        tree = build_suffix_tree(string)

        # This will cache both from tree and (forest, tree) pair.
        # We actually use id(...) in our keys because they are fast to hash.
        cached = {}

        # Make a forest out of a tree.
        def make_forest (t):
            # Only do work if we have not been here.
            key = id(t)
            if key not in cached:
                new_children = {}
                for c in t.children:
                    new_children[c] = make_forest(t.children[c])
                cached[key] = SuffixForest(first_string_pos=i, first_tree=t, next_forest=None,
                                           children=new_children, strings=strings)
            return cached[key]


        # Recursively record the forest.  Caching matters because we would otherwise
        # visit the same node repeatedly.
        def add_tree_to_forest (f, t):
            # Add tree t to forest f
            # Only do work if we have not been here
            key = (id(f), id(t))
            if key not in cached:
                new_children = f.children.copy()
                for c in t.children:
                    if c in new_children:
                        # Recursively merge tree into forest.
                        new_children[c] = add_tree_to_forest(new_children[c], t.children[c])
                    else:
                        new_children[c] = make_forest(t.children[c])
                cached[key] = SuffixForest(first_string_pos=i, first_tree=t, next_forest=f,
                                           children=new_children, strings=strings)
            return cached[key]

        if forest is None:
            forest = make_forest(tree)
        else:
            forest = add_tree_to_forest(forest, tree)
    return forest
def match_suffix_forest(forest, string):
    # Navigate the forest to find the match.
    for c in string:
        if c not in forest.children:
            return {}
        forest = forest.children[c]

    # Now build the match in a readable format.
    answer = {}
    while forest:
        matched_string = forest.strings[forest.first_string_pos]
        tree = forest.first_tree
        positions = []
        while tree:
            positions.append(tree.first_pos - len(string) + 1)
            tree = tree.next_tree
        answer[matched_string] = positions
        forest = forest.next_forest
    return answer

forest = build_suffix_forest(['foo', 'bbar', 'bazbar'])
print(match_suffix_forest(forest, 'ba'))

请注意,在这两个版本中,查找的大部分工作都是格式化一个不错的答案。长度为m的字符串的查找本身为O(m),无论我们的集合中有多少个字符串,或者匹配了多少次。