想象一下,有一个很大的字符串S数组。从该数组中,我只需要获取包含特定子字符串的那些字符串。例如,如果我的数组是
String s [] = {"hello world", "back to hell", "say hello world"};
而我的关键字是“ hello”,那么它应该返回我的第一个和最后一个元素。
我尝试使用KMP和Boyer-Moor算法来检查数组中的每个字符串是否包含子字符串,但是这花费了太多时间。
然后,我了解了Aho-Corasick算法。我仍然在查找它,但似乎它需要一个子字符串数组和一个大字符串来匹配,而我想要的却恰恰相反。
因此,我一直在寻找有关如何针对我的目的修改Aho-Corasick算法或实现这些目的的另一种建议。谢谢您的建议。
答案 0 :(得分:1)
使用Ukkonen算法或this source(PDF)中建议的算法来构建后缀树:
McCreight的算法可以轻松调整,以为一组 S = {s1,s2,。。。建立通用后缀树。 。 。 ,s_k} 在 O(N)时间内总长度N 的字符串...
然后使用创建的后缀树搜索给定的模式。问题是找到后缀树T中所有出现的模式P(长度m)。根据上述来源:
模式匹配问题可以在最佳 O(m + k)时间内解决,其中k是T中P出现的次数
请注意,文本的长度(或数组中的字符串数)不会影响搜索效率。因此,您可以支付一次构造后缀树的费用,然后多次使用它来有效地搜索短模式字符串。
编辑:如果您很着急并且不介意额外的时间复杂性,则可以使用this approach(PDF)在O(n中构造后缀数组而不是后缀树* log ^ 2(n))和一小段代码。这是这种方法的核心思想:
该算法主要基于维护以2 ^ k个长前缀排序的字符串后缀的顺序。
这是从上述来源复制来的伪代码:
n ←length(T)
for i←0 : n – 1
P(0, i)← position of T(i) in the ordered array of T‘s characters
cnt ← 1
for k←1 : [log2n] (ceil)
for i←0 : n – 1
L(i)← (P(k – 1, i), P(k – 1, i + cnt), i)
sort L
compute P(k, i) , i = 0, n - 1
cnt←2 * cnt
运行此代码后,P
将包含后缀数组。使用这种方法进行搜索也很简单:
由于后缀数组提供了T后缀的顺序,因此搜索 将字符串P转换为T很容易通过二进制搜索完成。由于比较 在O(| P |)
中完成
答案 1 :(得分:0)
首先,您必须使用Ukkonen的算法来构建后缀树。
from collections import namedtuple
SuffixTree = namedtuple('SuffixTree', 'first_pos next_tree children')
SuffixForest = namedtuple('SuffixForest', 'first_string_pos first_tree next_forest children strings')
# This is Ukkonen's Suffix Tree algorithm.
# It is O(n * c) memory and time for a string of length n with c different characters.
def build_suffix_tree (string):
children = {}
# Building the from the end means that the suffixes are arranged in order.
for i in range(len(string) - 1, -1, -1):
char = string[i]
if char in children:
node = SuffixTree(first_pos=i, children=children.copy(), next_tree = children[char])
else:
node = SuffixTree(first_pos=i, children=children.copy(), next_tree=None)
children[char] = node
# And our final tree!
return SuffixTree(first_pos=-1, children=children, next_tree=None)
# This returns an array of positions that match.
def match_suffix_tree (tree, string):
# Navigate the tree to find the match.
for c in string:
if c not in tree.children:
return []
tree = tree.children[c]
# Turn the match into an easily understood answer.
answer = []
while tree is not None:
answer.append(tree.first_pos - len(string) + 1)
tree = tree.next_tree
return answer
tree = build_suffix_tree('foo')
print(match_suffix_tree(tree, 'oo'))
请注意,所有内容均在链表下方。同样的观点可以多次指出。当打印时,它看起来会像一个非常大的数据结构,但是通过构造,您在字符串中每个字符有一个节点,每个节点有一个查询表。
但是,这不是您想要的答案。您想要的答案是许多字符串。因此,我们将在树木之外建造一片森林。森林是一个非常相似的数据结构,其中每个森林都是您可能所在的树的链接列表,并且每个森林都有一个查找表。
这可能是一个相当大的数据结构,但又不会像它看起来的那样大,因为您一遍又一遍地引用相同的东西。例如,尽管有很多获取字符串的方法,但实际上实际上只保留了一个字符串列表。
SuffixForest = namedtuple('SuffixForest', 'first_string_pos first_tree next_forest children strings')
# This returns a suffix forest for the matches in common across many trees.
def build_suffix_forest (strings):
children = {}
forest = None
# Building the forest from the end means that the strings are arranged in order.
for i in range(len(strings) - 1, -1, -1):
string = strings[i]
tree = build_suffix_tree(string)
# This will cache both from tree and (forest, tree) pair.
# We actually use id(...) in our keys because they are fast to hash.
cached = {}
# Make a forest out of a tree.
def make_forest (t):
# Only do work if we have not been here.
key = id(t)
if key not in cached:
new_children = {}
for c in t.children:
new_children[c] = make_forest(t.children[c])
cached[key] = SuffixForest(first_string_pos=i, first_tree=t, next_forest=None,
children=new_children, strings=strings)
return cached[key]
# Recursively record the forest. Caching matters because we would otherwise
# visit the same node repeatedly.
def add_tree_to_forest (f, t):
# Add tree t to forest f
# Only do work if we have not been here
key = (id(f), id(t))
if key not in cached:
new_children = f.children.copy()
for c in t.children:
if c in new_children:
# Recursively merge tree into forest.
new_children[c] = add_tree_to_forest(new_children[c], t.children[c])
else:
new_children[c] = make_forest(t.children[c])
cached[key] = SuffixForest(first_string_pos=i, first_tree=t, next_forest=f,
children=new_children, strings=strings)
return cached[key]
if forest is None:
forest = make_forest(tree)
else:
forest = add_tree_to_forest(forest, tree)
return forest
def match_suffix_forest(forest, string):
# Navigate the forest to find the match.
for c in string:
if c not in forest.children:
return {}
forest = forest.children[c]
# Now build the match in a readable format.
answer = {}
while forest:
matched_string = forest.strings[forest.first_string_pos]
tree = forest.first_tree
positions = []
while tree:
positions.append(tree.first_pos - len(string) + 1)
tree = tree.next_tree
answer[matched_string] = positions
forest = forest.next_forest
return answer
forest = build_suffix_forest(['foo', 'bbar', 'bazbar'])
print(match_suffix_forest(forest, 'ba'))
请注意,在这两个版本中,查找的大部分工作都是格式化一个不错的答案。长度为m
的字符串的查找本身为O(m)
,无论我们的集合中有多少个字符串,或者匹配了多少次。