使用ElasticSearch和MinHash tokenf过滤器检索近似kappa分数

时间:2019-03-13 14:12:26

标签: elasticsearch minhash

我想使用MinHash令牌过滤器检索相似的文档。我创建了一个索引,如下所示,该索引使用空白令牌生成器,并随后对每个令牌进行哈希处理:

{
  "settings": {
    "analysis": {
      "filter": {
        "my_minhash_filter": {
          "type": "min_hash",
          "hash_count": 1,   
          "bucket_count": 128, 
          "hash_set_size": 1, 
          "with_rotation": true 
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "my_minhash_filter"
          ]
        }
      }
    }
  },
  "mappings": {
      "_doc": {
        "properties": {
          "title": {
            "type": "text",
            "analyzer": "my_analyzer",
            "search_analyzer": "my_analyzer"
          }
        }
      }
  }
}

随后,我将20个新闻组数据集添加到了Elasticsearch数据库中

from elasticsearch import Elasticsearch
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

for index, example in enumerate(twenty_train['data']):
    es.create(index='test_analyzer', doc_type='_doc', id=index, body={"title": example})

这似乎很好地使用了MinHash令牌过滤器,但是答案却很奇怪。例如,当我查询一个确切的项目时,我得到以下结果:

{
    "query": {
        "match": {
            "title": "From: steve@titan.tsd.arlut.utexas.edu (Steve Glicker)\nSubject: 2 1000W Power Supplies\nNntp-Posting-Host: rooster\nOrganization: Applied Research Labs, The University of Texas at Austin\nDistribution: misc\nLines: 14\n\nTwo LH Research SM11-1 power supplies (SM10 series).\n\n1000W, 5V, 200A (currently wired for 115VAC)\n\nControl lines: +/- sense, on/off, pwr.fail, high/low margin, and\ncurrent monitor.\n\n(The list price from LH Research is $824.00 each for qty. 1-9)\n\nAsking $500.00 for the pair.\n\nSteve Glicker\nAustin, Texas\n(steve@titan.tsd.arlut.utexas.edu)\n"
        }
    }
}

Answer:
{
    "took": 8,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 10999,
        "max_score": 0.6097297,
        "hits": [
            {
                "_index": "test_analyzer",
                "_type": "_doc",
                "_id": "8161",
                "_score": 0.6097297,
                "_source": {
                    "title": "From: steve@titan.tsd.arlut.utexas.edu (Steve Glicker)\nSubject: 2 1000W Power Supplies\nNntp-Posting-Host: rooster\nOrganization: Applied Research Labs, The University of Texas at Austin\nDistribution: misc\nLines: 14\n\nTwo LH Research SM11-1 power supplies (SM10 series).\n\n1000W, 5V, 200A (currently wired for 115VAC)\n\nControl lines: +/- sense, on/off, pwr.fail, high/low margin, and\ncurrent monitor.\n\n(The list price from LH Research is $824.00 each for qty. 1-9)\n\nAsking $500.00 for the pair.\n\nSteve Glicker\nAustin, Texas\n(steve@titan.tsd.arlut.utexas.edu)\n"
                }
            },
            {
                "_index": "test_analyzer",
                "_type": "_doc",
                "_id": "8901",
                "_score": 0.60938174,
                "_source": {
                    "title": "Organization: City University of New York\nFrom: <F36SI@CUNYVM.BITNET>\nSubject: Model United Nations\nLines: 3\n\n    Just observed at the National Model United Nations here in NYC.\n    Just one word on it : AWSOME.\n                                 Peace, matt\n"
                }
            },
....

第一个结果与预期的一样,即我搜索的实际帖子。但是,它的得分不会比下一项高很多,而后者要相差很多。如何查询该数据库,以使我只能得到近似kappa分数高于某个阈值的文本?

我尝试使用谷歌搜索,但遗憾的是,使用MinHash查找相似项目的信息很少。

提前谢谢!

0 个答案:

没有答案