当关键词是多词时,有效地搜索关键词

时间:2018-01-15 07:38:53

标签: python string pattern-matching string-matching keyword-search

我需要使用python有效地匹配字符串中非常大的关键字列表(> 1000000)。我找到了一些非常好的库,试图快速完成这个:

1)FlashText(https://github.com/vi3k6i5/flashtext

2)Aho-Corasick算法等。

但是我有一个特殊的要求:在我的上下文中,如果我的字符串是'XXXX是YYYY的非常好的指示',则关键字说'XXXX YYYY'应该返回匹配。请注意,'XXXX YYYY'不是作为子字符串出现的,但字符串中存在XXXX和YYYY,这对我来说足够好了。

我知道如何天真地做到这一点。我正在寻找的是效率,还有更好的库吗?

2 个答案:

答案 0 :(得分:1)

这属于“天真”的阵营,但这里有一种方法,它使用集合作为思考的食物:

docs = [
    """ Here's a sentence with dog and apple in it """,
    """ Here's a sentence with dog and poodle in it """,
    """ Here's a sentence with poodle and apple in it """,
    """ Here's a dog with and apple and a poodle in it """,
    """ Here's an apple with a dog to show that order is irrelevant """
]

query = ['dog', 'apple']

def get_similar(query, docs):
    res = []
    query_set = set(query)
    for i in docs:
        # if all n elements of query are in i, return i
        if query_set & set(i.split(" ")) == query_set:
            res.append(i)
    return res

返回:

[" Here's a sentence with dog and apple in it ", 
" Here's a dog with and apple and a poodle in it ", 
" Here's an apple with a dog to show that order is irrelevant "]

当然,时间复杂度并不是那么好,但是由于执行散列/设置操作的速度,它比使用整体列表要快得多。

第2部分是Elasticsearch是一个很好的候选人,如果你愿意付出努力并且你正在处理大量的数据。

答案 1 :(得分:1)

你的问题听起来像a full text search任务。有一个名为whoosh的Python搜索包。 @rerek的语料库可以在内存中索引和搜索,如下所示。

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

这会产生:

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

您还可以使用How to index documents中描述的FileStorage来保留索引。