我想使用MinHash令牌过滤器检索相似的文档。我创建了一个索引,如下所示,该索引使用空白令牌生成器,并随后对每个令牌进行哈希处理:
{
"settings": {
"analysis": {
"filter": {
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 1,
"bucket_count": 128,
"hash_set_size": 1,
"with_rotation": true
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"my_minhash_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer"
}
}
}
}
}
随后,我将20个新闻组数据集添加到了Elasticsearch数据库中
from elasticsearch import Elasticsearch
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
for index, example in enumerate(twenty_train['data']):
es.create(index='test_analyzer', doc_type='_doc', id=index, body={"title": example})
这似乎很好地使用了MinHash令牌过滤器,但是答案却很奇怪。例如,当我查询一个确切的项目时,我得到以下结果:
{
"query": {
"match": {
"title": "From: steve@titan.tsd.arlut.utexas.edu (Steve Glicker)\nSubject: 2 1000W Power Supplies\nNntp-Posting-Host: rooster\nOrganization: Applied Research Labs, The University of Texas at Austin\nDistribution: misc\nLines: 14\n\nTwo LH Research SM11-1 power supplies (SM10 series).\n\n1000W, 5V, 200A (currently wired for 115VAC)\n\nControl lines: +/- sense, on/off, pwr.fail, high/low margin, and\ncurrent monitor.\n\n(The list price from LH Research is $824.00 each for qty. 1-9)\n\nAsking $500.00 for the pair.\n\nSteve Glicker\nAustin, Texas\n(steve@titan.tsd.arlut.utexas.edu)\n"
}
}
}
Answer:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 10999,
"max_score": 0.6097297,
"hits": [
{
"_index": "test_analyzer",
"_type": "_doc",
"_id": "8161",
"_score": 0.6097297,
"_source": {
"title": "From: steve@titan.tsd.arlut.utexas.edu (Steve Glicker)\nSubject: 2 1000W Power Supplies\nNntp-Posting-Host: rooster\nOrganization: Applied Research Labs, The University of Texas at Austin\nDistribution: misc\nLines: 14\n\nTwo LH Research SM11-1 power supplies (SM10 series).\n\n1000W, 5V, 200A (currently wired for 115VAC)\n\nControl lines: +/- sense, on/off, pwr.fail, high/low margin, and\ncurrent monitor.\n\n(The list price from LH Research is $824.00 each for qty. 1-9)\n\nAsking $500.00 for the pair.\n\nSteve Glicker\nAustin, Texas\n(steve@titan.tsd.arlut.utexas.edu)\n"
}
},
{
"_index": "test_analyzer",
"_type": "_doc",
"_id": "8901",
"_score": 0.60938174,
"_source": {
"title": "Organization: City University of New York\nFrom: <F36SI@CUNYVM.BITNET>\nSubject: Model United Nations\nLines: 3\n\n Just observed at the National Model United Nations here in NYC.\n Just one word on it : AWSOME.\n Peace, matt\n"
}
},
....
第一个结果与预期的一样,即我搜索的实际帖子。但是,它的得分不会比下一项高很多,而后者要相差很多。如何查询该数据库,以使我只能得到近似kappa分数高于某个阈值的文本?
我尝试使用谷歌搜索,但遗憾的是,使用MinHash查找相似项目的信息很少。
提前谢谢!