Question

我试图在巨大的文本中找到包含子字符串（作为输入）的单词。文字如下：* america * python * erica * escape * .. 示例：输入：＆＃34; rica＆＃34; =＆GT;产出：美国，埃里卡

我使用后缀数组。

我的伪代码（pythonlike）是：

firstChar=input[0] // the first character of input
suffixArray=getSuffixArray(text) // suffix array
result=[]

for every index of suffix array which points to firstChar:
    length=len(input)
    indexText=text[suffixArray[index]]
    indexes=[]

    if input in text[indexText: indexText+length]:
        word=find whole word containig this index between '*' 
        result.append(word)

这样可行，但速度太慢了。 LCP数组应该改善algorhitm的运行时间，但我无法弄清楚如何。你能给我一个建议吗？

提前致谢！

Answer 1

后缀数组的免费Python代码位于Effcient way to find longest duplicate string。它可以在个人计算机上运行多达1亿个字符。

使用后缀数组和lcp在文本中快速查找子字符串的方法

1 个答案: