Elasticsearch评分解释与实际评分功能不一样

时间:2017-02-28 14:16:11

标签: elasticsearch lucene

https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html我们有以下函数来计算分数。

score(q,d)  =  
            queryNorm(q)  
          · coord(q,d)    
          · ∑ (           
                tf(t in d)   
              · idf(t)²      
              · t.getBoost() 
              · norm(t,d)    
            ) (t in q) 

然而,当看下面的例子时,解释似乎存在一些不一致。 1)说明只显示idf而不是idf²。

2)协调因素在哪里?

3)从解释来看,得分似乎是通过以下公式计算的:(tf * idf * fieldNorm)+(子句数* boost * queryNorm)

索引文档:

PUT test/type/1
{
  "text": "a b c"
}

查询:

GET test/type/_search
{
  "explain":"true",
  "query": {
    "match": {
      "text": "a"
    }
  }
}

结果:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.15342641,
    "hits": [
      {
        "_shard": 3,
        "_node": "5QvbXVlRSku-p_g81ZXpjQ",
        "_index": "test",
        "_type": "type",
        "_id": "1",
        "_score": 0.15342641,
        "_source": {
          "text": "a b c"
        },
        "_explanation": {
          "value": 0.15342641,
          "description": "sum of:",
          "details": [
            {
              "value": 0.15342641,
              "description": "weight(text:a in 0) [PerFieldSimilarity], result of:",
              "details": [
                {
                  "value": 0.15342641,
                  "description": "fieldWeight in 0, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "tf(freq=1.0), with freq of:",
                      "details": [
                        {
                          "value": 1,
                          "description": "termFreq=1.0",
                          "details": []
                        }
                      ]
                    },
                    {
                      "value": 0.30685282,
                      "description": "idf(docFreq=1, maxDocs=1)",
                      "details": []
                    },
                    {
                      "value": 0.5,
                      "description": "fieldNorm(doc=0)",
                      "details": []
                    }
                  ]
                }
              ]
            },
            {
              "value": 0,
              "description": "match on required clause, product of:",
              "details": [
                {
                  "value": 0,
                  "description": "# clause",
                  "details": []
                },
                {
                  "value": 3.2588913,
                  "description": "_type:type, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": 3.2588913,
                      "description": "queryNorm",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

1 个答案:

答案 0 :(得分:0)

  • 您缺少一个idf案例,因为您的查询中只有一个子句。 idf的第二次乘法来自查询权重,在这样一个简单的查询中你不会看到它。第二个idf被querynorm取消了。 Querynorm(简化一点)是:1 / √ (∑ idf^2),单个术语变为:1 / idf,因此查询权重变为idf / idf。所有这些都是隐含的,只有一个子句,没有什么可以权衡术语,所以查询权重不需要计算。

  • 此查询中只有一个术语,因此无需考虑。也就是说,coord = overlap / maxOverlap = 1/1 = 1

  • 不知道这是从哪里来的。我相信你会被抛出一点_type个查询。似乎是添加到搜索特定Elasticsearch类型的必需术语。请注意,此查询的分数为零。因此,所有匹配都必须符合指定的_type,但该术语根本不应影响得分。

如果您想在评分算法中查看所有工作,您需要使用更接近实际条件的测试数据集和查询。此测试具有单个简单文档和单个简单查询。在这种情况下,是的算法看起来很简单:

  

得分= tf * idf * fieldNorm = 1 * 0.30685282 * .5

但是您没有看到coord,查询规范或整体wuery权重计算,因为您的查询过于简单。您没有看到特别有意义的idf(或tf),因为只有一个文档和一个匹配。你没有看到总和,因为你对一个术语有一次命中,所以没有什么要总和的。该算法主要用于从较大的数据集中产生有意义的分数。