从https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html我们有以下函数来计算分数。
score(q,d) =
queryNorm(q)
· coord(q,d)
· ∑ (
tf(t in d)
· idf(t)²
· t.getBoost()
· norm(t,d)
) (t in q)
然而,当看下面的例子时,解释似乎存在一些不一致。 1)说明只显示idf而不是idf²。
2)协调因素在哪里?
3)从解释来看,得分似乎是通过以下公式计算的:(tf * idf * fieldNorm)+(子句数* boost * queryNorm)
索引文档:
PUT test/type/1
{
"text": "a b c"
}
查询:
GET test/type/_search
{
"explain":"true",
"query": {
"match": {
"text": "a"
}
}
}
结果:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.15342641,
"hits": [
{
"_shard": 3,
"_node": "5QvbXVlRSku-p_g81ZXpjQ",
"_index": "test",
"_type": "type",
"_id": "1",
"_score": 0.15342641,
"_source": {
"text": "a b c"
},
"_explanation": {
"value": 0.15342641,
"description": "sum of:",
"details": [
{
"value": 0.15342641,
"description": "weight(text:a in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.15342641,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)",
"details": []
},
{
"value": 0.5,
"description": "fieldNorm(doc=0)",
"details": []
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 3.2588913,
"description": "_type:type, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 3.2588913,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
}
]
}
}
答案 0 :(得分:0)
您缺少一个idf案例,因为您的查询中只有一个子句。 idf的第二次乘法来自查询权重,在这样一个简单的查询中你不会看到它。第二个idf被querynorm取消了。 Querynorm(简化一点)是:1 / √ (∑ idf^2)
,单个术语变为:1 / idf
,因此查询权重变为idf / idf。所有这些都是隐含的,只有一个子句,没有什么可以权衡术语,所以查询权重不需要计算。
此查询中只有一个术语,因此无需考虑。也就是说,coord = overlap / maxOverlap = 1/1 = 1
不知道这是从哪里来的。我相信你会被抛出一点_type
个查询。似乎是添加到搜索特定Elasticsearch类型的必需术语。请注意,此查询的分数为零。因此,所有匹配都必须符合指定的_type,但该术语根本不应影响得分。
如果您想在评分算法中查看所有工作,您需要使用更接近实际条件的测试数据集和查询。此测试具有单个简单文档和单个简单查询。在这种情况下,是的算法看起来很简单:
得分= tf * idf * fieldNorm = 1 * 0.30685282 * .5
但是您没有看到coord,查询规范或整体wuery权重计算,因为您的查询过于简单。您没有看到特别有意义的idf(或tf),因为只有一个文档和一个匹配。你没有看到总和,因为你对一个术语有一次命中,所以没有什么要总和的。该算法主要用于从较大的数据集中产生有意义的分数。