为什么弹性搜索返回错误的相关性分数?

时间:2019-01-19 17:44:49

标签: elasticsearch elastic-stack

我正在学习弹性搜索,我在 megacorp 索引中插入了类型为 employee 的以下数据:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.6931472,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

然后我运行了以下请求:

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}

但是我得到的结果如下:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6682933,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.6682933,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

我怀疑以下记录的相关性得分:

{
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }

小于前一个。我用

运行查询
  

解释:是

并得到以下结果:

        {
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6682933,
    "hits" : [
      {
        "_shard" : "[megacorp][2]",
        "_node" : "pGtCz_FvSTmteJwQKvn_lg",
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.6682933,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ],
          "fielddata" : true
        },
        "_explanation" : {
          "value" : 0.6682933,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.6682933,
              "description" : "weight(about:rock in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.6682933,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.6931472,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.96414346,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.5,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[megacorp][3]",
        "_node" : "pGtCz_FvSTmteJwQKvn_lg",
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ],
          "fielddata" : true
        },
        "_explanation" : {
          "value" : 0.5753642,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "weight(about:rock in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.2876821,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.2876821,
              "description" : "weight(about:climbing in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.2876821,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

您能告诉我这是什么原因吗?

1 个答案:

答案 0 :(得分:2)

简短的回答:Elasticsearch中的相关性不是一个简单的话题:)下面的详细信息。

我正试图重提您的案子...

首先我放了两个文件:

POST /megacorp/employee/1
{
  "first_name": "John",
  "last_name": "Smith",
  "age": 25,
  "about": "I love to go rock climbing",
  "interests": [
    "sports",
    "music"
  ]
}

POST /megacorp/employee/2
{
  "first_name": "Jane",
  "last_name": "Smith",
  "age": 32,
  "about": "I like to collect rock albums",
  "interests": [
    "music"
  ]
}

后来我使用了您的查询:

GET /megacorp/employee/_search
{
  "query": {
    "match": {
      "about": "rock climbing"
    }
  }
}

我的结果完全不同:

{
  "took": 89,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "music"
          ]
        }
      }
    ]
  }
}

如您所见,结果按“预期”顺序排列。请注意,_score值与您完全不同。

问题是:为什么?发生什么事了?

Practical BM25 - Part 1: How Shards Affect Relevance Scoring in Elasticsearch文章中介绍了针对这种情况的详细答案。

简短地说:您可能会注意到,Elasticsearch将文档存储在各个分片中。为了更快,默认情况下它使用query_then_fetch策略。这意味着Elasticsearch首先在每个分片上询问结果,然后将结果取回并呈现给用户。当然,分数计算也会发生同样的情况。

如您所见,在我们的结果中查询了5个分片。如果未在创建索引时指定,则Elasticsearch默认使用5个分片(可以使用number_of_shards参数来指定)。这就是为什么我们的分数不同的原因。此外,如果您自己尝试再次执行此操作,则很有可能再次获得不同的结果。一切都取决于文档在分片之间的分配方式。如果您将此索引的number_of_shards设置为1,则每次都会得到相同的分数。

article中还提到了另一件事:

  

人们开始只将一些文档加载到索引中并询问   “为什么文档A的得分比文档B高/低”和   有时答案是用户拥有相对较高的   分片到文档,以便分数在不同的位置倾斜   碎片。

Elasticsearch旨在维护大量数据,并且您将更多数据放入索引中,您获得的结果越准确。

希望我的回答能解释您的疑问。