我需要使用python有效地匹配字符串中非常大的关键字列表(> 1000000)。我找到了一些非常好的库,试图快速完成这个:
1)FlashText(https://github.com/vi3k6i5/flashtext)
2)Aho-Corasick算法等。
但是我有一个特殊的要求:在我的上下文中,如果我的字符串是'XXXX是YYYY的非常好的指示',则关键字说'XXXX YYYY'应该返回匹配。请注意,'XXXX YYYY'不是作为子字符串出现的,但字符串中存在XXXX和YYYY,这对我来说足够好了。
我知道如何天真地做到这一点。我正在寻找的是效率,还有更好的库吗?
答案 0 :(得分:1)
这属于“天真”的阵营,但这里有一种方法,它使用集合作为思考的食物:
docs = [ """ Here's a sentence with dog and apple in it """, """ Here's a sentence with dog and poodle in it """, """ Here's a sentence with poodle and apple in it """, """ Here's a dog with and apple and a poodle in it """, """ Here's an apple with a dog to show that order is irrelevant """ ] query = ['dog', 'apple'] def get_similar(query, docs): res = [] query_set = set(query) for i in docs: # if all n elements of query are in i, return i if query_set & set(i.split(" ")) == query_set: res.append(i) return res
返回:
[" Here's a sentence with dog and apple in it ", " Here's a dog with and apple and a poodle in it ", " Here's an apple with a dog to show that order is irrelevant "]
当然,时间复杂度并不是那么好,但是由于执行散列/设置操作的速度,它比使用整体列表要快得多。
第2部分是Elasticsearch是一个很好的候选人,如果你愿意付出努力并且你正在处理大量的数据。
答案 1 :(得分:1)
你的问题听起来像a full text search任务。有一个名为whoosh的Python搜索包。 @rerek的语料库可以在内存中索引和搜索,如下所示。
from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields
texts = [
"Here's a sentence with dog and apple in it",
"Here's a sentence with dog and poodle in it",
"Here's a sentence with poodle and apple in it",
"Here's a dog with and apple and a poodle in it",
"Here's an apple with a dog to show that order is irrelevant"
]
schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()
writer = index.writer()
for t in texts:
writer.add_document(text = t)
writer.commit()
query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)
for r in results:
print(r)
这会产生:
<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>
您还可以使用How to index documents中描述的FileStorage
来保留索引。