Question

我需要使用python有效地匹配字符串中非常大的关键字列表（＆gt; 1000000）。我找到了一些非常好的库，试图快速完成这个：

1）FlashText（https://github.com/vi3k6i5/flashtext）

2）Aho-Corasick算法等。

但是我有一个特殊的要求：在我的上下文中，如果我的字符串是'XXXX是YYYY的非常好的指示'，则关键字说'XXXX YYYY'应该返回匹配。请注意，'XXXX YYYY'不是作为子字符串出现的，但字符串中存在XXXX和YYYY，这对我来说足够好了。

我知道如何天真地做到这一点。我正在寻找的是效率，还有更好的库吗？

Answer 1

这属于“天真”的阵营，但这里有一种方法，它使用集合作为思考的食物：

docs = [
    """ Here's a sentence with dog and apple in it """,
    """ Here's a sentence with dog and poodle in it """,
    """ Here's a sentence with poodle and apple in it """,
    """ Here's a dog with and apple and a poodle in it """,
    """ Here's an apple with a dog to show that order is irrelevant """
]

query = ['dog', 'apple']

def get_similar(query, docs):
    res = []
    query_set = set(query)
    for i in docs:
        # if all n elements of query are in i, return i
        if query_set & set(i.split(" ")) == query_set:
            res.append(i)
    return res

返回：

[" Here's a sentence with dog and apple in it ", 
" Here's a dog with and apple and a poodle in it ", 
" Here's an apple with a dog to show that order is irrelevant "]

当然，时间复杂度并不是那么好，但是由于执行散列/设置操作的速度，它比使用整体列表要快得多。

第2部分是Elasticsearch是一个很好的候选人，如果你愿意付出努力并且你正在处理大量的数据。

Answer 2

你的问题听起来像a full text search任务。有一个名为whoosh的Python搜索包。 @rerek的语料库可以在内存中索引和搜索，如下所示。

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

这会产生：

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

您还可以使用How to index documents中描述的FileStorage来保留索引。

当关键词是多词时，有效地搜索关键词

2 个答案: