我有很多领域的产品索引,特别是每一个都用形态学和同义词过滤器进行分析。
简化为2个字段索引在这里:
https://gist.github.com/anonymous/6e287d328a72df07bc491312820ffdef
第一次查询:
GET /products/nms/_search
{
"size": 40,
"_source": {
"include": [
"_id"
]
},
"query": {
"multi_match": {
"fields": [
"subject.value^2",
"colors"
],
"minimum_should_match": "30%",
"operator": "and",
"query": "футболка белая",
"type": "cross_fields"
}
}
}
结果:
"hits": {
"total": 6615,
"max_score": 9.118673,
他们是对的。
但是当我交换单词时,第二个查询:
GET /products/nms/_search
{
"size": 40,
"_source": {
"include": [
"_id"
]
},
"query": {
"multi_match": {
"fields": [
"subject.value^2",
"colors"
],
"minimum_should_match": "30%",
"operator": "and",
"query": "белая футболка",
"type": "cross_fields"
}
}
}
我得到了:
"hits": {
"total": 145434,
"max_score": 10.683464,
并没有类似于第一个结果,而不是前100个匹配中的单个匹配。
花了一些时间挖掘它,但仍然无法得到解决方案。 由于文档结构(超过15个字段),我被迫使用cross_fileds,据我所知,在这种情况下 - 弹性计数任何字段上同义词的每次命中,然后有10个用于“белая”(白色)没有“футболка”(T恤)。
例如,我们有4个文档
PUT products_color_test/nms/1
{
"colors": "белая", //white
"subject" : {
"id" :1,
"value": "футболка"} //t-shirt
}
PUT products_color_test/nms/2
{
"colors": "черная", //black
"subject" : {
"id" :1,
"value": "футболка"} //t-shirt
}
PUT products_color_test/nms/3
{
"colors": "молочная", //synonym to white
"subject" : {
"id" :1,
"value": "футболка"} //t-shirt
}
PUT products_color_test/nms/4
{
"colors": "молочная", //synonym to white
"subject" : {
"id" :2,
"value": "куртка"} //jacket
}
让我们测试一下。
GET /products_color_test/nms/_search
{
"size": 40,
"query": {
"multi_match": {
"fields": [
"subject.value^2",
"colors"
],
"minimum_should_match": "30%",
"operator": "and",
"query": "футболка белая",
"type": "cross_fields"
}
}
}
结果是:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.58422226,
"hits": [
{
"_index": "products_color_test",
"_type": "nms",
"_id": "3",
"_score": 0.58422226,
"_source": {
"colors": "молочная",
"subject": {
"id": 1,
"value": "футболка"
}
}
},
{
"_index": "products_color_test",
"_type": "nms",
"_id": "1",
"_score": 0.568724,
"_source": {
"colors": "белая",
"subject": {
"id": 1,
"value": "футболка"
}
}
}
]
}
}
几乎是核心,同义词命中获得更高分而不是精确命中。
但在交换之后:
GET /products_color_test/nms/_search
{
"size": 40,
"query": {
"multi_match": {
"fields": [
"subject.value^2",
"colors"
],
"minimum_should_match": "30%",
"operator": "and",
"query": "белая футболка",
"type": "cross_fields"
}
}
}
"hits": {
"total": 3,
"max_score": 0.58422226,
"hits": [
{
"_index": "products_color_test",
"_type": "nms",
"_id": "3",
"_score": 0.58422226,
"_source": {
"colors": "молочная",
"subject": {
"id": 1,
"value": "футболка"
}
}
},
{
"_index": "products_color_test",
"_type": "nms",
"_id": "1",
"_score": 0.568724,
"_source": {
"colors": "белая",
"subject": {
"id": 1,
"value": "футболка"
}
}
},
{
"_index": "products_color_test",
"_type": "nms",
"_id": "4",
"_score": 0.46449086,
"_source": {
"colors": "молочная",
"subject": {
"id": 2,
"value": "куртка" // jacket ----!!!!!----
}
}
}
]
}
}
问题:
谢谢!
PS。对不起我的英文
答案 0 :(得分:0)
似乎像添加
"expand": false
同义词过滤器解决了这个谜题。据我所知 - 这使得ES在索引时只占用第一个同义词,但在搜索时使用整个扩展集。
现在两个交换查询的结果相似,而ES计数同义词只打了一次
"_explanation": {
"value": 0.5622277,
"description": "sum of:",
"details": [
{
"value": 0.5622277,
"description": "sum of:",
"details": [
{
"value": 0.37481847,
"description": "max of:",
"details": [
{
"value": 0.37481847,
"description": "weight(subject.value:футболка in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.37481847,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.37481847,
"description": "queryWeight, product of:",
"details": [
{
"value": 2,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "idf(docFreq=3, maxDocs=4)",
"details": []
},
{
"value": 0.18740924,
"description": "queryNorm",
"details": []
}
]
},
{
"value": 1,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
}
]
},
{
"value": 1,
"description": "idf(docFreq=3, maxDocs=4)",
"details": []
},
{
"value": 1,
"description": "fieldNorm(doc=0)",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0.18740924,
"description": "max of:",
"details": [
{
"value": 0.18740924,
"description": "weight(colors:белый in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.18740924,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.18740924,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=3, maxDocs=4)",
"details": []
},
{
"value": 0.18740924,
"description": "queryNorm",
"details": []
}
]
},
{
"value": 1,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
}
]
},
{
"value": 1,
"description": "idf(docFreq=3, maxDocs=4)",
"details": []
},
{
"value": 1,
"description": "fieldNorm(doc=0)",
"details": []
}
]
}
]
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 0.18740924,
"description": "_type:nms, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 0.18740924,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
},