Question

Elasticsearch版本：5.0.2

我用我的索引填充：

{_id: 1, tags: ['plop', 'plip', 'plup']},
{_id: 2, tags: ['plop', 'plup']},
{_id: 3, tags: ['plop']},
{_id: 4, tags: ['plap', 'plep']},
{_id: 5, tags: ['plop', 'plip', 'plup']},
{_id: 6, tags: ['plup', 'plip']},
{_id: 7, tags: ['plop', 'plip']}

然后，我想检索标记plop和plip的最大相关行：

query: {
  bool: {
    should: [
      {term: {tags: {value:'plop', _name: 'plop'}}},
      {term: {tags: {value:'plip', _name: 'plip'}}}
    ]
  }
}

相当于（但我使用前一个调试）：

query: {
  bool: {
    should: [
      {terms: {tags: ['plop', 'plip']}}
    ]
  }
}

然后，我发现很奇怪的分数：

[
  { id: '2', score: 0.88002616, tags: [ 'plop', 'plup' ] },
  { id: '6', score: 0.88002616, tags: [ 'plup', 'plip' ] },
  { id: '5', score: 0.5063205, tags: [ 'plop', 'plip', 'plup' ] },
  { id: '7', score: 0.3610978, tags: [ 'plop', 'plip' ] },
  { id: '1', score: 0.29277915, tags: [ 'plop', 'plip', 'plup' ] },
  { id: '3', score: 0.2876821, tags: [ 'plop' ] }
]

以下是回复的详细信息：

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 0.88002616,
    "hits": [
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "2",
        "_score": 0.88002616,
        "_source": {
          "tags": [
            "plop",
            "plup"
          ]
        },
        "matched_queries": [
          "plop"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "6",
        "_score": 0.88002616,
        "_source": {
          "tags": [
            "plup",
            "plip"
          ]
        },
        "matched_queries": [
          "plip"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "5",
        "_score": 0.5063205,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        },
        "matched_queries": [
          "plop",
          "plip"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "7",
        "_score": 0.3610978,
        "_source": {
          "tags": [
            "plop",
            "plip"
          ]
        },
        "matched_queries": [
          "plop",
          "plip"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "1",
        "_score": 0.29277915,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        },
        "matched_queries": [
          "plop",
          "plip"
        ]
      },
      {
        "_index": "myindex",
        "_type": "mytype",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "tags": [
            "plop"
          ]
        },
        "matched_queries": [
          "plop"
        ]
      }
    ]
  }
}

所以，有两个问题：

为什么只加载一个查询（id 2和6）的行的得分高于匹配的2（id 1,5和7）？
为什么具有相同标签的两行可以有不同的分数？（id 1和5）

我错过了什么吗？

Answer 1

好的，我找出你真正的问题。默认情况下，Elasitcsearch使用5个分片来存储您的索引数据，如果您的数字很小，那么在计算您的_score值时可能很重要。关于分片的一些理论：https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html

为什么重要？因为为了获得更好的性能，每个分片都会根据自己的数据进行_score计算。但是计算得分值弹性搜索使用IDF / TF算法依赖于文档总数和搜索术语的频率（IN SHARD）（https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html）

要解决此问题，您可以使用一个分片创建索引，如下所示：

{
"settings": {
        "number_of_shards" :   1,
        "number_of_replicas" : 0
    },
  "mappings": {
    "my_type": {
      "properties": {
        "tags": {
          "type":  "keyword"
        }
      }
    }
  }
}

您可以在搜索查询中使用？explain验证我的理论：

http://localhost:9200/test1/my_type/_search?explain

如果您需要更多，可以阅读此示例;）这些是我的查询结果：[＆＃34; plop＆＃34;，＆＃34; plip＆＃34;]

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 0.9808292,
    "hits": [
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.9808292,
        "_source": {
          "tags": [
            "plop",
            "plup"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "6",
        "_score": 0.9808292,
        "_source": {
          "tags": [
            "plup",
            "plip"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "5",
        "_score": 0.5753642,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.36464313,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "7",
        "_score": 0.36464313,
        "_source": {
          "tags": [
            "plop",
            "plip"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "tags": [
            "plop"
          ]
        }
      }
    ]
  }
}

为什么文件以plop，plip，plup为第三？检查这个的解释：

   "_shard": "[test][1]",
        "_node": "LjGrgIa7QgiPlEvMxqKOdA",
        "_index": "test",
        "_type": "my_type",
        "_id": "5",
        "_score": 0.5753642,
        "_source": {
          "tags": [
            "plop",
            "plip",
            "plup"
          ]
        },

这是此分片中唯一的一个doc：test [1]（我在其他退回的文档中验证过）!!所以IDF值等于＆＃39; 1＆＃39;这是最高的价值。分数= TF / IDF（因此对于较低的IDF，分数较高）。检查如何为此doc计算0.5753642分数：

 "value": 0.2876821,
                  "description": "weight(tags:plop...

                      "details": [
                        {
                          "value": 0.2876821,
                          "description": "idf(docFreq=1, docCount=1)",

与

相加

  {
                  "value": 0.2876821,
                  "description": "weight(tags:plip..

                          "value": 0.2876821,
                          "description": "idf(docFreq=1, docCount=1)",
                          "details": []
                        },

Answer 2

我在 jgr 的答案中很好地解释了我遇到的问题。

我找到的解决方案是将dfs_query_then_fetch用作search type。

以下是使用JavaScript客户端生成的查询：

body: {
  query: {
    bool: {
      should: [
        {terms: {tags: ['plop', 'plip']}}
      ]
    }
  },
  searchType: 'dfs_query_then_fetch'
}

请注意，索引类型中的数据越多，这肯定不需要，因为分数会在分片之间自然平衡。

Elasticsearch简单术语查询给出了奇怪的分数

2 个答案: