Elasticsearch版本:5.0.2
我用我的索引填充:
{_id: 1, tags: ['plop', 'plip', 'plup']},
{_id: 2, tags: ['plop', 'plup']},
{_id: 3, tags: ['plop']},
{_id: 4, tags: ['plap', 'plep']},
{_id: 5, tags: ['plop', 'plip', 'plup']},
{_id: 6, tags: ['plup', 'plip']},
{_id: 7, tags: ['plop', 'plip']}
然后,我想检索标记plop
和plip
的最大相关行:
query: {
bool: {
should: [
{term: {tags: {value:'plop', _name: 'plop'}}},
{term: {tags: {value:'plip', _name: 'plip'}}}
]
}
}
相当于(但我使用前一个调试):
query: {
bool: {
should: [
{terms: {tags: ['plop', 'plip']}}
]
}
}
然后,我发现很奇怪的分数:
[
{ id: '2', score: 0.88002616, tags: [ 'plop', 'plup' ] },
{ id: '6', score: 0.88002616, tags: [ 'plup', 'plip' ] },
{ id: '5', score: 0.5063205, tags: [ 'plop', 'plip', 'plup' ] },
{ id: '7', score: 0.3610978, tags: [ 'plop', 'plip' ] },
{ id: '1', score: 0.29277915, tags: [ 'plop', 'plip', 'plup' ] },
{ id: '3', score: 0.2876821, tags: [ 'plop' ] }
]
以下是回复的详细信息:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0.88002616,
"hits": [
{
"_index": "myindex",
"_type": "mytype",
"_id": "2",
"_score": 0.88002616,
"_source": {
"tags": [
"plop",
"plup"
]
},
"matched_queries": [
"plop"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "6",
"_score": 0.88002616,
"_source": {
"tags": [
"plup",
"plip"
]
},
"matched_queries": [
"plip"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "5",
"_score": 0.5063205,
"_source": {
"tags": [
"plop",
"plip",
"plup"
]
},
"matched_queries": [
"plop",
"plip"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "7",
"_score": 0.3610978,
"_source": {
"tags": [
"plop",
"plip"
]
},
"matched_queries": [
"plop",
"plip"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "1",
"_score": 0.29277915,
"_source": {
"tags": [
"plop",
"plip",
"plup"
]
},
"matched_queries": [
"plop",
"plip"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "3",
"_score": 0.2876821,
"_source": {
"tags": [
"plop"
]
},
"matched_queries": [
"plop"
]
}
]
}
}
所以,有两个问题:
我错过了什么吗?
答案 0 :(得分:1)
好的,我找出你真正的问题。默认情况下,Elasitcsearch使用5个分片来存储您的索引数据,如果您的数字很小,那么在计算您的_score值时可能很重要。关于分片的一些理论:https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html
为什么重要?因为为了获得更好的性能,每个分片都会根据自己的数据进行_score计算。但是计算得分值弹性搜索使用IDF / TF算法依赖于文档总数和搜索术语的频率(IN SHARD)(https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html)
要解决此问题,您可以使用一个分片创建索引,如下所示:
{
"settings": {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"mappings": {
"my_type": {
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
}
您可以在搜索查询中使用?explain验证我的理论:
如果您需要更多,可以阅读此示例;) 这些是我的查询结果:[" plop"," plip"]
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0.9808292,
"hits": [
{
"_index": "test",
"_type": "my_type",
"_id": "2",
"_score": 0.9808292,
"_source": {
"tags": [
"plop",
"plup"
]
}
},
{
"_index": "test",
"_type": "my_type",
"_id": "6",
"_score": 0.9808292,
"_source": {
"tags": [
"plup",
"plip"
]
}
},
{
"_index": "test",
"_type": "my_type",
"_id": "5",
"_score": 0.5753642,
"_source": {
"tags": [
"plop",
"plip",
"plup"
]
}
},
{
"_index": "test",
"_type": "my_type",
"_id": "1",
"_score": 0.36464313,
"_source": {
"tags": [
"plop",
"plip",
"plup"
]
}
},
{
"_index": "test",
"_type": "my_type",
"_id": "7",
"_score": 0.36464313,
"_source": {
"tags": [
"plop",
"plip"
]
}
},
{
"_index": "test",
"_type": "my_type",
"_id": "3",
"_score": 0.2876821,
"_source": {
"tags": [
"plop"
]
}
}
]
}
}
为什么文件以plop,plip,plup为第三?检查这个的解释:
"_shard": "[test][1]",
"_node": "LjGrgIa7QgiPlEvMxqKOdA",
"_index": "test",
"_type": "my_type",
"_id": "5",
"_score": 0.5753642,
"_source": {
"tags": [
"plop",
"plip",
"plup"
]
},
这是此分片中唯一的一个doc:test [1](我在其他退回的文档中验证过)!!所以IDF值等于' 1'这是最高的价值。分数= TF / IDF(因此对于较低的IDF,分数较高)。检查如何为此doc计算0.5753642分数:
"value": 0.2876821,
"description": "weight(tags:plop...
"details": [
{
"value": 0.2876821,
"description": "idf(docFreq=1, docCount=1)",
与
相加 {
"value": 0.2876821,
"description": "weight(tags:plip..
"value": 0.2876821,
"description": "idf(docFreq=1, docCount=1)",
"details": []
},
答案 1 :(得分:1)
我在 jgr 的答案中很好地解释了我遇到的问题。
我找到的解决方案是将dfs_query_then_fetch
用作search type。
以下是使用JavaScript客户端生成的查询:
body: {
query: {
bool: {
should: [
{terms: {tags: ['plop', 'plip']}}
]
}
},
searchType: 'dfs_query_then_fetch'
}
请注意,索引类型中的数据越多,这肯定不需要,因为分数会在分片之间自然平衡。