Python - 在文本中搜索事件

时间:2012-10-07 01:05:27

标签: python algorithm search

我有一个短语列表。我需要检查这些短语的一部分是否出现在一大块文本中。

e.g。

  • Marshmallows are delicious and warm
  • Giant unicorns sign wonderful melodies of the imminent apocalypse
  • The wizards assaulted the fort, but forgot their spell books at home!

文本块是:

Marshmallows are delicious. I've been snacking on them while the wizards assaulted the fort. The unicorns sign wonderful melodies of those who forgot their spell books at home. [...]


额外注意事项:

我不能依靠停止词来分裂,例如“和”,“或”和标点符号。


关于图书馆和/或战略的任何想法?

谢谢:)

2 个答案:

答案 0 :(得分:1)

您可以按长度的降序创建每个短语的“部分”,然后在文本块中找到这些部分。

e.g。

>>> text = "Marshmallows are delicious. I've been snacking on them while the wizards assaulted the fort. The unicorns sign wonderful melodies of those who forgot their spell books at home."
>>> phrase='Giant unicorns sign wonderful melodies of the imminent apocalypse'
>>> words = phrase.split()
>>> parts = list()
>>> for length in range(len(words),3,-1): #Assuming a part is atleast 3 words
    for start in range(0,len(words)-length + 1):
        parts.append(' '.join(words[start:start+length]))
>>> #A step of -1 ensures the list is sorted in a decreasing order of length.
>>> parts
['Giant unicorns sign wonderful melodies of the imminent apocalypse', 'Giant unicorns sign wonderful melodies of the imminent', 'unicorns sign wonderful melodies of the imminent apocalypse', 'Giant unicorns sign wonderful melodies of the', 'unicorns sign wonderful melodies of the imminent', 'sign wonderful melodies of the imminent apocalypse', 'Giant unicorns sign wonderful melodies of', 'unicorns sign wonderful melodies of the', 'sign wonderful melodies of the imminent', 'wonderful melodies of the imminent apocalypse', 'Giant unicorns sign wonderful melodies', 'unicorns sign wonderful melodies of', 'sign wonderful melodies of the', 'wonderful melodies of the imminent', 'melodies of the imminent apocalypse', 'Giant unicorns sign wonderful', 'unicorns sign wonderful melodies', 'sign wonderful melodies of', 'wonderful melodies of the', 'melodies of the imminent', 'of the imminent apocalypse']
>>> for part in parts:
    if part.lower() in text.lower(): #for case insensitivity
        found = part
        break

>>> found
'unicorns sign wonderful melodies of'

答案 1 :(得分:0)

查看Xapian是否存储可搜索信息并检索它(结果=结果!)以及Levenshtein距离算法,其中有几个模块。