Elasticsearch为相同的文档提供不同的分数

时间:2013-01-29 10:28:39

标签: elasticsearch

我有一些文档具有相同的内容但是当我尝试查询这些文档时,虽然查询字段包含相同的文本,但我得到的分数不同。我已经解释了分数,但我无法分析并找到不同分数的原因。

我的查询是

 curl 'localhost:9200/acqindex/_search?pretty=1' -d '{
    "explain" : true,
    "query" : {           
        "query_string" : {         
            "query" : "text:shimla"
        }
    }     
  }'

搜索回复:

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 31208,
    "max_score" : 268.85962,
    "hits" : [ {
      "_shard" : 0,
      "_node" : "KOebAnGhSJKUHLPNxndcpQ",
      "_index" : "acqindex",
      "_type" : "autocomplete_questions",
      "_id" : "50efec6c38cc6fdabd8653a3",
      "_score" : 268.85962, "_source" : {"_class":"com.ixigo.next.cms.model.AutoCompleteObject","_id":"50efec6c38cc6fdabd8653a3","ad":"rajasthan,IN","category":["Destination"],"ctype":"destination","eid":"503b2a65e4b032e338f0d24b","po":8.772307692307692,"text":"shimla","url":"/travel-guide/shimla"},
      "_explanation" : {
        "value" : 268.85962,
        "description" : "sum of:",
        "details" : [ {
          "value" : 38.438133,
          "description" : "weight(text:shi in 5860), product of:",
          "details" : [ {
            "value" : 0.37811017,
            "description" : "queryWeight(text:shi), product of:",
            "details" : [ {
              "value" : 5.0829277,
              "description" : "idf(docFreq=7503, maxDocs=445129)"
            }, {
              "value" : 0.074388266,
              "description" : "queryNorm"
            } ]
          }, {
            "value" : 101.658554,
            "description" : "fieldWeight(text:shi in 5860), product of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "tf(termFreq(text:shi)=1)"
            }, {
              "value" : 5.0829277,
              "description" : "idf(docFreq=7503, maxDocs=445129)"
            }, {
              "value" : 20.0,
              "description" : "fieldNorm(field=text, doc=5860)"
            } ]
          } ]
        }, {
          "value" : 66.8446,
          "description" : "weight(text:shim in 5860), product of:",
          "details" : [ {
            "value" : 0.49862078,
            "description" : "queryWeight(text:shim), product of:",
            "details" : [ {
              "value" : 6.7029495,
              "description" : "idf(docFreq=1484, maxDocs=445129)"
            }, {
              "value" : 0.074388266,
              "description" : "queryNorm"
            } ]
          }, {
            "value" : 134.05899,
            "description" : "fieldWeight(text:shim in 5860), product of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "tf(termFreq(text:shim)=1)"
            }, {
              "value" : 6.7029495,
              "description" : "idf(docFreq=1484, maxDocs=445129)"
            }, {
              "value" : 20.0,
              "description" : "fieldNorm(field=text, doc=5860)"
            } ]
          } ]
        }, {
          "value" : 81.75818,
          "description" : "weight(text:shiml in 5860), product of:",
          "details" : [ {
            "value" : 0.5514458,
            "description" : "queryWeight(text:shiml), product of:",
            "details" : [ {
              "value" : 7.413075,
              "description" : "idf(docFreq=729, maxDocs=445129)"
            }, {
              "value" : 0.074388266,
              "description" : "queryNorm"
            } ]
          }, {
            "value" : 148.2615,
            "description" : "fieldWeight(text:shiml in 5860), product of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "tf(termFreq(text:shiml)=1)"
            }, {
              "value" : 7.413075,
              "description" : "idf(docFreq=729, maxDocs=445129)"
            }, {
              "value" : 20.0,
              "description" : "fieldNorm(field=text, doc=5860)"
            } ]
          } ]
        }, {
          "value" : 81.8187,
          "description" : "weight(text:shimla in 5860), product of:",
          "details" : [ {
            "value" : 0.55164987,
            "description" : "queryWeight(text:shimla), product of:",
            "details" : [ {
              "value" : 7.415818,
              "description" : "idf(docFreq=727, maxDocs=445129)"
            }, {
              "value" : 0.074388266,
              "description" : "queryNorm"
            } ]
          }, {
            "value" : 148.31636,
            "description" : "fieldWeight(text:shimla in 5860), product of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "tf(termFreq(text:shimla)=1)"
            }, {
              "value" : 7.415818,
              "description" : "idf(docFreq=727, maxDocs=445129)"
            }, {
              "value" : 20.0,
              "description" : "fieldNorm(field=text, doc=5860)"
            } ]
          } ]
        } ]
      }
    }, {
      "_shard" : 1,
      "_node" : "KOebAnGhSJKUHLPNxndcpQ",
      "_index" : "acqindex",
      "_type" : "autocomplete_questions",
      "_id" : "50efed1c38cc6fdabd8b8d2f",
      "_score" : 268.29953, "_source" : {"_id":"50efed1c38cc6fdabd8b8d2f","ad":"himachal pradesh,IN","category":["Hill","See and Do","Destination","Mountain","Nature and Wildlife"],"ctype":"destination","eid":"503b2a64e4b032e338f0d0af","po":8.781970310391364,"text":"shimla","url":"/travel-guide/shimla"},
      "_explanation" : {
        "value" : 268.29953,
        "description" : "sum of:",
        "details" : [ {
          "value" : 38.52957,
          "description" : "weight(text:shi in 14769), product of:",
          "details" : [ {
            "value" : 0.37895453,
            "description" : "queryWeight(text:shi), product of:",
            "details" : [ {
              "value" : 5.083667,
              "description" : "idf(docFreq=7263, maxDocs=431211)"
            }, {
              "value" : 0.07454354,
              "description" : "queryNorm"
            } ]
          }, {
            "value" : 101.67334,
            "description" : "fieldWeight(text:shi in 14769), product of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "tf(termFreq(text:shi)=1)"
            }, {
              "value" : 5.083667,
              "description" : "idf(docFreq=7263, maxDocs=431211)"
            }, {
              "value" : 20.0,
              "description" : "fieldNorm(field=text, doc=14769)"
            } ]
          } ]
        }, {
          "value" : 66.67524,
          "description" : "weight(text:shim in 14769), product of:",
          "details" : [ {
            "value" : 0.49850821,
            "description" : "queryWeight(text:shim), product of:",
            "details" : [ {
              "value" : 6.6874766,
              "description" : "idf(docFreq=1460, maxDocs=431211)"
            }, {
              "value" : 0.07454354,
              "description" : "queryNorm"
            } ]
          }, {
            "value" : 133.74953,
            "description" : "fieldWeight(text:shim in 14769), product of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "tf(termFreq(text:shim)=1)"
            }, {
              "value" : 6.6874766,
              "description" : "idf(docFreq=1460, maxDocs=431211)"
            }, {
              "value" : 20.0,
              "description" : "fieldNorm(field=text, doc=14769)"
            } ]
          } ]
        }, {
          "value" : 81.53204,
          "description" : "weight(text:shiml in 14769), product of:",
          "details" : [ {
            "value" : 0.5512571,
            "description" : "queryWeight(text:shiml), product of:",
            "details" : [ {
              "value" : 7.3951015,
              "description" : "idf(docFreq=719, maxDocs=431211)"
            }, {
              "value" : 0.07454354,
              "description" : "queryNorm"
            } ]
          }, {
            "value" : 147.90204,
            "description" : "fieldWeight(text:shiml in 14769), product of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "tf(termFreq(text:shiml)=1)"
            }, {
              "value" : 7.3951015,
              "description" : "idf(docFreq=719, maxDocs=431211)"
            }, {
              "value" : 20.0,
              "description" : "fieldNorm(field=text, doc=14769)"
            } ]
          } ]
        }, {
          "value" : 81.56268,
          "description" : "weight(text:shimla in 14769), product of:",
          "details" : [ {
            "value" : 0.55136067,
            "description" : "queryWeight(text:shimla), product of:",
            "details" : [ {
              "value" : 7.3964915,
              "description" : "idf(docFreq=718, maxDocs=431211)"
            }, {
              "value" : 0.07454354,
              "description" : "queryNorm"
            } ]
          }, {
            "value" : 147.92982,
            "description" : "fieldWeight(text:shimla in 14769), product of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "tf(termFreq(text:shimla)=1)"
            }, {
              "value" : 7.3964915,
              "description" : "idf(docFreq=718, maxDocs=431211)"
            }, {
              "value" : 20.0,
              "description" : "fieldNorm(field=text, doc=14769)"
            } ]
          } ]
        } ]
      }
    }
  }
}

文件是:

  

{ “_类”: “com.ixigo.next.cms.model.AutoCompleteObject”, “_ ID”: “50efec6c38cc6fdabd8653a3”, “广告”: “拉贾斯坦邦,IN”, “类别”:[ “目的地”], “CTYPE”: “目的地”, “EID”: “503b2a65e4b032e338f0d24b”, “PO”:8.772307692307692,的 “文本”: “西姆拉”下, “URL”: “/旅行指南/西姆拉”}

     

{ “_ ID”: “50efed1c38cc6fdabd8b8d2f”, “广告”:“喜马偕   pradesh,IN“,”category“:[”Hill“,”See and   做“,”目的地“,”山“,”自然与野生动物“],”ctype“:”目的地“,”eid“:”503b2a64e4b032e338f0d0af“,”po“:8.781970310391364,” text“:”shimla“ 下, “URL”: “/旅行指南/西姆拉”}

请指导我理解分数差异的原因。

2 个答案:

答案 0 :(得分:29)

lucene得分取决于different factors。使用tf idf相似性(默认值为1),它主要取决于:

  1. 期限频率:文件中经常发现的条款数
  2. 反向文档频率:文档中出现的术语(索引时)
  3. 字段规范(包括索引时间提升)。较短的字段得分高于较长的字段。
  4. 在您的情况下,您必须考虑到您的两个文档来自不同的分片,因此分数会在每个分片上单独计算,因为每个分片实际上都是一个单独的lucene索引。

    您可能希望查看弹性搜索提供的更为昂贵的DFS, Query then Fetch search type以获得更准确的评分。默认的是简单查询然后获取。

答案 1 :(得分:0)

javanna明确指出了表明分数不同的问题,因为分数在多个分片中发生。这些分片可能具有不同数量的文档。这会影响评分算法。

但是,Elasticsearch: The Definitive Guide的作者告知:

本地IDF和全局IDF [反向文档频率]之间的差异减少了添加到索引中的更多文档。有了现实的数据量,本地IDF很快就会恢复正常。问题不是相关性被破坏了,而是数据太少了。

您不应在生产中使用dfs_query_then_fetch。为了进行测试,请将索引放在一个主分片上或指定?search_type=dfs_query_then_fetch