有这些文件:
{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
和
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
我想根据每个标签的置信度值计算_score。例如,如果你搜索“山”,它应该只返回id为1的doc,如果搜索“landscape”,则得分2应高于1,因为2中的置信度高于1(48.36 vs 33.66)。如果搜索“海岸景观”,则此时间分数1应高于2,因为文档1在标记数组中同时包含海岸和横向。我还希望将得分乘以“boost_multiplier”来提升一些文档来对抗其他人。
我在SO Elasticsearch: Influence scoring with custom score field in document
中找到了这个问题但是当我尝试接受的解决方案(我在ES服务器中启用了脚本)时,无论搜索词是什么,它都会返回带有_score 1.0的文档。这是我试过的查询:
{
"query": {
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "coast landscape"
}
},
"script_score": {
"script": "doc[\"confidence\"].value"
}
}
}
}
}
}
我也尝试了@yahermann在评论中提出的建议,将“script_score”替换为“field_value_factor”:{“field”:“confidence”},仍然是相同的结果。知道它失败的原因,还是有更好的方法呢?
为了得到完整的图片,这里是我使用过的映射定义:
{
"mappings": {
"photo": {
"properties": {
"created_at": {
"type": "date"
},
"description": {
"type": "text"
},
"height": {
"type": "short"
},
"id": {
"type": "keyword"
},
"tags": {
"type": "nested",
"properties": {
"tag": { "type": "string" },
"confidence": { "type": "float"}
}
},
"width": {
"type": "short"
},
"color": {
"type": "string"
},
"boost_multiplier": {
"type": "float"
}
}
}
},
"settings": {
"number_of_shards": 1
}
}
更新 按照下面@Joanna的回答,我尝试了查询,但实际上,无论我在匹配查询,coast,foo,bar中放置什么,它总是返回两个文件都带有_score 1.0,我在elasticsearch 2.4上尝试过。 Docker中的6,5.3,5.5.1。以下是我得到的回复:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 1635
{"took":24,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index","_type":"my_type","_id":"2","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
},{"_index":"my_index","_type":"my_type","_id":"1","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
}]}}
UPDATE-2 我在SO上找到了这个:Elasticsearch: "function_score" with "boost_mode":"replace" ignores function score
它基本上说,如果函数不匹配,它返回1.这是有道理的,但我正在运行查询相同的文档。这令人困惑。
最终更新 最后我找到了问题,愚蠢的我。 ES101,如果你向搜索API发送GET请求,它会返回所有得分为1.0的文件:)你应该发送POST请求...很多@Joanna,它运作得很好!!!
答案 0 :(得分:2)
您可以尝试此查询 - 它将评分与confidence
和boost_multiplier
字段结合起来:
{
"query": {
"function_score": {
"query": {
"bool": {
"should": [{
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "landscape"
}
},
"field_value_factor": {
"field": "tags.confidence",
"factor": 1,
"missing": 0
}
}
}
}
}]
}
},
"field_value_factor": {
"field": "boost_multiplier",
"factor": 1,
"missing": 0
}
}
}
}
当我使用coast
字词进行搜索时,会返回:
id=1
的文档,因为只有这个有此术语,评分为"_score": 100.27469
。 当我使用landscape
字词进行搜索时,会返回两个文档:
id=2
的文档和得分" _score":85.83046 id=1
的文档和得分" _score":59.7339 由于id=2
的文档具有较高的confidence
字段值,因此得分会更高。
当我使用coast landscape
字词进行搜索时,会返回两个文档:
id=1
的文档和得分" _score":160.00859 id=2
的文档和得分" _score":85.83046 虽然id=2
的文档具有较高的confidence
字段值,但id=1
的文档具有匹配的字词,因此得分更高。通过更改"factor": 1
参数的值,您可以决定confidence
应该对结果产生多大影响。
当我为新文档编制索引时会发生更有趣的事情:让我们说它与包含id=2
的文档几乎相同,但我设置了"boost_multiplier" : 4
和"id": 3
:
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "3",
"tags" : [
...
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
...
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 4
}
使用coast landscape
字词运行相同的查询会返回三个文档:
id=3
的文档和得分" _score":360.02664 id=1
的文档和得分" _score":182.09859 id=2
的文档和得分" _score":90.00666 虽然包含id=3
的文档只有一个匹配的字词landscape
),但其boost_multiplier
值会大大增加评分。在这里,使用"factor": 1
,您还可以决定此值应该增加多少得分,并且"missing": 0
决定如果没有索引这样的字段会发生什么。