我有一组通过NLP算法从文本中提取的单词,每个文档中每个单词的相关分数。
例如:
document 1: { "vocab": [ {"wtag":"James Bond", "rscore": 2.14 },
{"wtag":"world", "rscore": 0.86 },
....,
{"wtag":"somemore", "rscore": 3.15 }
]
}
document 2: { "vocab": [ {"wtag":"hiii", "rscore": 1.34 },
{"wtag":"world", "rscore": 0.94 },
....,
{"wtag":"somemore", "rscore": 3.23 }
]
}
我希望每个文档中匹配rscore
的{{1}}影响ES提供给它的wtag
,可能会乘以或添加到_score
,以影响最终文件的最终_score
(反过来,顺序)。有没有办法实现这个目标?
答案 0 :(得分:17)
另一种解决方法是使用嵌套文档:
首先设置映射以使vocab
成为嵌套文档,这意味着每个wtag
/ rscore
文档将在内部编入索引作为单独的文档:
curl -XPUT "http://localhost:9200/myindex/" -d'
{
"settings": {"number_of_shards": 1},
"mappings": {
"mytype": {
"properties": {
"vocab": {
"type": "nested",
"fields": {
"wtag": {
"type": "string"
},
"rscore": {
"type": "float"
}
}
}
}
}
}
}'
然后索引你的文档:
curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
"vocab": [
{
"wtag": "James Bond",
"rscore": 2.14
},
{
"wtag": "world",
"rscore": 0.86
},
{
"wtag": "somemore",
"rscore": 3.15
}
]
}'
curl -XPUT "http://localhost:9200/myindex/mytype/2" -d'
{
"vocab": [
{
"wtag": "hiii",
"rscore": 1.34
},
{
"wtag": "world",
"rscore": 0.94
},
{
"wtag": "somemore",
"rscore": 3.23
}
]
}'
运行nested
查询以匹配所有嵌套文档,并为每个匹配的嵌套文档添加rscore
的值:
curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
"query": {
"nested": {
"path": "vocab",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"vocab.wtag": "james bond world"
}
},
"script_score": {
"script": "doc[\"rscore\"].value"
}
}
}
}
}
}'
答案 1 :(得分:8)
查看可用于将分数存储为有效负载的delimited payload token filter,以及text scoring in scripts,可让您访问有效负载。
更新包含示例
首先,您需要设置一个分析器,该分析器将获取|
之后的数字,并将该值作为有效负载存储在每个令牌中:
curl -XPUT "http://localhost:9200/myindex/" -d'
{
"settings": {
"analysis": {
"analyzer": {
"payloads": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
" delimited_payload_filter"
]
}
}
}
},
"mappings": {
"mytype": {
"properties": {
"text": {
"type": "string",
"analyzer": "payloads",
"term_vector": "with_positions_offsets_payloads"
}
}
}
}
}'
然后索引你的文件:
curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
"text": "James|2.14 Bond|2.14 world|0.86 somemore|3.15"
}'
最后,使用遍历每个字词的function_score
查询进行搜索,检索有效负载并将其与_score
合并:
curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
"query": {
"function_score": {
"query": {
"match": {
"text": "james bond"
}
},
"script_score": {
"script": "score=0; for (term: my_terms) { termInfo = _index[\"text\"].get(term,_PAYLOADS ); for (pos : termInfo) { score = score + pos.payloadAsFloat(0);} } return score;",
"params": {
"my_terms": [
"james",
"bond"
]
}
}
}
}
}'
当脚本本身没有压缩成一行时,它看起来像这样:
score=0;
for (term: my_terms) {
termInfo = _index['text'].get(term,_PAYLOADS );
for (pos : termInfo) {
score = score + pos.payloadAsFloat(0);
}
}
return score;
警告:访问有效负载会产生很大的性能损失,运行脚本也会产生性能损失。您可能希望使用上述动态脚本对其进行试验,然后在对结果满意时将脚本重写为本机Java脚本。
答案 2 :(得分:2)
我认为script_score
函数就是您所需要的(doc)。
如果您使用旧版本检查custom score queries
,则在0.90.4中引入了功能得分查询答案 3 :(得分:1)