Elasticsearch简单术语查询给出了奇怪的分数

时间:2016-12-08 10:19:05

标签: elasticsearch

Elasticsearch版本:5.0.2

我用我的索引填充:

{_id: 1, tags: ['plop', 'plip', 'plup']},
{_id: 2, tags: ['plop', 'plup']},
{_id: 3, tags: ['plop']},
{_id: 4, tags: ['plap', 'plep']},
{_id: 5, tags: ['plop', 'plip', 'plup']},
{_id: 6, tags: ['plup', 'plip']},
{_id: 7, tags: ['plop', 'plip']}

然后,我想检索标记plopplip的最大相关行:

query: {
  bool: {
    should: [
      {term: {tags: {value:'plop', _name: 'plop'}}},
      {term: {tags: {value:'plip', _name: 'plip'}}}
    ]
  }
}

相当于(但我使用前一个调试):

query: {
  bool: {
    should: [
      {terms: {tags: ['plop', 'plip']}}
    ]
  }
}

然后,我发现很奇怪的分数:

[
  { id: '2', score: 0.88002616, tags: [ 'plop', 'plup' ] },
  { id: '6', score: 0.88002616, tags: [ 'plup', 'plip' ] },
  { id: '5', score: 0.5063205, tags: [ 'plop', 'plip', 'plup' ] },
  { id: '7', score: 0.3610978, tags: [ 'plop', 'plip' ] },
  { id: '1', score: 0.29277915, tags: [ 'plop', 'plip', 'plup' ] },
  { id: '3', score: 0.2876821, tags: [ 'plop' ] }
]

以下是回复的详细信息:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 0.88002616,
    "hits": [
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "2",
        "_score": 0.88002616,
        "_source": {
          "tags": [
            "plop",
            "plup"
          ]
        },
        "matched_queries": [
          "plop"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "6",
        "_score": 0.88002616,
        "_source": {
          "tags": [
            "plup",
            "plip"
          ]
        },
        "matched_queries": [
          "plip"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "5",
        "_score": 0.5063205,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        },
        "matched_queries": [
          "plop",
          "plip"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "7",
        "_score": 0.3610978,
        "_source": {
          "tags": [
            "plop",
            "plip"
          ]
        },
        "matched_queries": [
          "plop",
          "plip"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "1",
        "_score": 0.29277915,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        },
        "matched_queries": [
          "plop",
          "plip"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "tags": [
            "plop"
          ]
        },
        "matched_queries": [
          "plop"
        ]
      }
    ]
  }
}

所以,有两个问题:

  1. 为什么只加载一个查询(id 2和6)的行的得分高于匹配的2(id 1,5和7)?
  2. 为什么具有相同标签的两行可以有不同的分数? (id 1和5)
  3. 我错过了什么吗?

2 个答案:

答案 0 :(得分:1)

好的,我找出你真正的问题。默认情况下,Elasitcsearch使用5个分片来存储您的索引数据,如果您的数字很小,那么在计算您的_score值时可能很重要。关于分片的一些理论:https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html

为什么重要?因为为了获得更好的性能,每个分片都会根据自己的数据进行_score计算。但是计算得分值弹性搜索使用IDF / TF算法依赖于文档总数和搜索术语的频率(IN SHARD)(https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

要解决此问题,您可以使用一个分片创建索引,如下所示:

{
"settings": {
        "number_of_shards" :   1,
        "number_of_replicas" : 0
    },
  "mappings": {
    "my_type": {
      "properties": {
        "tags": {
          "type":  "keyword"
        }
      }
    }
  }
}

您可以在搜索查询中使用?explain验证我的理论:

  

http://localhost:9200/test1/my_type/_search?explain

如果您需要更多,可以阅读此示例;) 这些是我的查询结果:[" plop"," plip"]

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 0.9808292,
    "hits": [
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.9808292,
        "_source": {
          "tags": [
            "plop",
            "plup"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "6",
        "_score": 0.9808292,
        "_source": {
          "tags": [
            "plup",
            "plip"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "5",
        "_score": 0.5753642,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.36464313,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "7",
        "_score": 0.36464313,
        "_source": {
          "tags": [
            "plop",
            "plip"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "tags": [
            "plop"
          ]
        }
      }
    ]
  }
}

为什么文件以plop,plip,plup为第三?检查这个的解释:

   "_shard": "[test][1]",
        "_node": "LjGrgIa7QgiPlEvMxqKOdA",
        "_index": "test",
        "_type": "my_type",
        "_id": "5",
        "_score": 0.5753642,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        },

这是此分片中唯一的一个doc:test [1](我在其他退回的文档中验证过)!!所以IDF值等于' 1'这是最高的价值。分数= TF / IDF(因此对于较低的IDF,分数较高)。检查如何为此doc计算0.5753642分数:

 "value": 0.2876821,
                  "description": "weight(tags:plop...

                      "details": [
                        {
                          "value": 0.2876821,
                          "description": "idf(docFreq=1, docCount=1)",

相加
  {
                  "value": 0.2876821,
                  "description": "weight(tags:plip..

                          "value": 0.2876821,
                          "description": "idf(docFreq=1, docCount=1)",
                          "details": []
                        },

答案 1 :(得分:1)

我在 jgr 的答案中很好地解释了我遇到的问题。

我找到的解决方案是将dfs_query_then_fetch用作search type

以下是使用JavaScript客户端生成的查询:

body: {
  query: {
    bool: {
      should: [
        {terms: {tags: ['plop', 'plip']}}
      ]
    }
  },
  searchType: 'dfs_query_then_fetch'
}
  

请注意,索引类型中的数据越多,这肯定不需要,因为分数会在分片之间自然平衡。