字符串搜索库的结果 - 错误或功能或我的编码错误?

时间:2011-11-11 21:33:19

标签: python string algorithm search text

我正在使用this python库来实现Aho-Corasick字符串搜索算法,该算法在一次传递中找到给定字符串中的一组模式。输出不是我所期望的:

In [4]: import ahocorasick
In [5]: import collections

In [6]: tree = ahocorasick.KeywordTree()

In [7]: ss = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"

In [8]: words = ["first sentence is", "first sentence", "the first sentence", "the first sentence is"]

In [9]: for w in words:
   ...:     tree.add(w)

In [10]: tree.make()

In [13]: final = collections.defaultdict(int)

In [15]: for match in tree.findall(ss, allow_overlaps=True):
   ....:     final[ss[match[0]:match[1]]] += 1

In [16]: final
{   'the first sentence': 3, 'the first sentence is': 2}


  'the first sentence': 3,
  'the first sentence is': 2,
  'first sentence': 3,
  'first sentence is': 2


2 个答案:

答案 0 :(得分:1)



答案 1 :(得分:1)

我不知道ahocorasick模块,但这些结果似乎令人怀疑。 acora模块显示了这一点:

import acora
import collections

ss = "this is the first sentence in this book "
     "the first sentence is really the most interesting "
     "the first sentence is always first"

words = ["first sentence is", 
         "first sentence",
         "the first sentence",
         "the first sentence is"]

tree = acora.AcoraBuilder(*words).build()

for match in tree.findall(ss):
    result[match] += 1


>>> result
defaultdict(<type 'int'>, 
            {'the first sentence'   : 3,
             'first sentence'       : 3,
             'first sentence is'    : 2,
             'the first sentence is': 2})