我正在学习弹性搜索,我在 megacorp 索引中插入了类型为 employee 的以下数据:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "2",
"_score" : 0.6931472,
"_source" : {
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests" : [
"music"
]
}
},
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [
"sports",
"music"
]
}
}
]
}
}
然后我运行了以下请求:
GET /megacorp/employee/_search
{
"query" : {
"match" : {
"about" : "rock climbing"
}
}
}
但是我得到的结果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.6682933,
"hits" : [
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "2",
"_score" : 0.6682933,
"_source" : {
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests" : [
"music"
]
}
},
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [
"sports",
"music"
]
}
}
]
}
}
我怀疑以下记录的相关性得分:
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [
"sports",
"music"
]
}
}
小于前一个。我用
运行查询解释:是
并得到以下结果:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.6682933,
"hits" : [
{
"_shard" : "[megacorp][2]",
"_node" : "pGtCz_FvSTmteJwQKvn_lg",
"_index" : "megacorp",
"_type" : "employee",
"_id" : "2",
"_score" : 0.6682933,
"_source" : {
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests" : [
"music"
],
"fielddata" : true
},
"_explanation" : {
"value" : 0.6682933,
"description" : "sum of:",
"details" : [
{
"value" : 0.6682933,
"description" : "weight(about:rock in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.6682933,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.6931472,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 0.96414346,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 5.5,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 6.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
]
}
},
{
"_shard" : "[megacorp][3]",
"_node" : "pGtCz_FvSTmteJwQKvn_lg",
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [
"sports",
"music"
],
"fielddata" : true
},
"_explanation" : {
"value" : 0.5753642,
"description" : "sum of:",
"details" : [
{
"value" : 0.2876821,
"description" : "weight(about:rock in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 6.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 6.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.2876821,
"description" : "weight(about:climbing in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 6.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 6.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
]
}
}
您能告诉我这是什么原因吗?
答案 0 :(得分:2)
简短的回答:Elasticsearch中的相关性不是一个简单的话题:)下面的详细信息。
我正试图重提您的案子...
首先我放了两个文件:
POST /megacorp/employee/1
{
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
POST /megacorp/employee/2
{
"first_name": "Jane",
"last_name": "Smith",
"age": 32,
"about": "I like to collect rock albums",
"interests": [
"music"
]
}
后来我使用了您的查询:
GET /megacorp/employee/_search
{
"query": {
"match": {
"about": "rock climbing"
}
}
}
我的结果完全不同:
{
"took": 89,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_score": 0.5753642,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "2",
"_score": 0.2876821,
"_source": {
"first_name": "Jane",
"last_name": "Smith",
"age": 32,
"about": "I like to collect rock albums",
"interests": [
"music"
]
}
}
]
}
}
如您所见,结果按“预期”顺序排列。请注意,_score
值与您完全不同。
问题是:为什么?发生什么事了?
Practical BM25 - Part 1: How Shards Affect Relevance Scoring in Elasticsearch文章中介绍了针对这种情况的详细答案。
简短地说:您可能会注意到,Elasticsearch将文档存储在各个分片中。为了更快,默认情况下它使用query_then_fetch
策略。这意味着Elasticsearch首先在每个分片上询问结果,然后将结果取回并呈现给用户。当然,分数计算也会发生同样的情况。
如您所见,在我们的结果中查询了5个分片。如果未在创建索引时指定,则Elasticsearch默认使用5个分片(可以使用number_of_shards
参数来指定)。这就是为什么我们的分数不同的原因。此外,如果您自己尝试再次执行此操作,则很有可能再次获得不同的结果。一切都取决于文档在分片之间的分配方式。如果您将此索引的number_of_shards
设置为1,则每次都会得到相同的分数。
article中还提到了另一件事:
人们开始只将一些文档加载到索引中并询问 “为什么文档A的得分比文档B高/低”和 有时答案是用户拥有相对较高的 分片到文档,以便分数在不同的位置倾斜 碎片。
Elasticsearch旨在维护大量数据,并且您将更多数据放入索引中,您获得的结果越准确。
希望我的回答能解释您的疑问。