elasticsearch multi_match“A和B”结果不等于“B和A”

时间:2016-07-18 08:02:42

标签: elasticsearch

我有很多领域的产品索引,特别是每一个都用形态学和同义词过滤器进行分析。

简化为2个字段索引在这里:

https://gist.github.com/anonymous/6e287d328a72df07bc491312820ffdef

第一次查询:

GET /products/nms/_search
{
   "size": 40,
   "_source": {
      "include": [
         "_id"
      ]
   },
   "query": {
      "multi_match": {
         "fields": [
            "subject.value^2",
            "colors"
         ],
         "minimum_should_match": "30%",
         "operator": "and",
         "query": "футболка белая",
         "type": "cross_fields"
      }
   }
}

结果:

   "hits": {
      "total": 6615,
      "max_score": 9.118673,

他们是对的。

但是当我交换单词时,第二个查询:

GET /products/nms/_search
{
   "size": 40,
   "_source": {
      "include": [
         "_id"
      ]
   },
   "query": {
      "multi_match": {
         "fields": [
            "subject.value^2",
            "colors"
         ],
         "minimum_should_match": "30%",
         "operator": "and",
         "query": "белая футболка",
         "type": "cross_fields"
      }
   }
}

我得到了:

   "hits": {
      "total": 145434,
      "max_score": 10.683464,

并没有类似于第一个结果,而不是前100个匹配中的单个匹配。

花了一些时间挖掘它,但仍然无法得到解决方案。 由于文档结构(超过15个字段),我被迫使用cross_fileds,据我所知,在这种情况下 - 弹性计数任何字段上同义词的每次命中,然后有10个用于“белая”(白色)没有“футболка”(T恤)。

例如,我们有4个文档

PUT products_color_test/nms/1
{
    "colors": "белая", //white
    "subject" : {
        "id" :1,
        "value": "футболка"} //t-shirt
}
PUT products_color_test/nms/2
{
    "colors": "черная", //black
    "subject" : {
        "id" :1,
        "value": "футболка"} //t-shirt
}
PUT products_color_test/nms/3
{
    "colors": "молочная", //synonym to white
    "subject" : {
        "id" :1,
        "value": "футболка"} //t-shirt
}
PUT products_color_test/nms/4
{
    "colors": "молочная", //synonym to white
    "subject" : {
        "id" :2,
        "value": "куртка"} //jacket
}

让我们测试一下。

GET /products_color_test/nms/_search
{
   "size": 40,
   "query": {
      "multi_match": {
         "fields": [
            "subject.value^2",
            "colors"
         ],
         "minimum_should_match": "30%",
         "operator": "and",
         "query": "футболка белая",
         "type": "cross_fields"
      }
   }
}

结果是:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.58422226,
      "hits": [
         {
            "_index": "products_color_test",
            "_type": "nms",
            "_id": "3",
            "_score": 0.58422226,
            "_source": {
               "colors": "молочная",
               "subject": {
                  "id": 1,
                  "value": "футболка"
               }
            }
         },
         {
            "_index": "products_color_test",
            "_type": "nms",
            "_id": "1",
            "_score": 0.568724,
            "_source": {
               "colors": "белая",
               "subject": {
                  "id": 1,
                  "value": "футболка"
               }
            }
         }
      ]
   }
}

几乎是核心,同义词命中获得更高分而不是精确命中。

但在交换之后:

GET /products_color_test/nms/_search
{
   "size": 40,
   "query": {
      "multi_match": {
         "fields": [
            "subject.value^2",
            "colors"
         ],
         "minimum_should_match": "30%",
         "operator": "and",
         "query": "белая футболка",
         "type": "cross_fields"
      }
   }
}


   "hits": {
  "total": 3,
  "max_score": 0.58422226,
  "hits": [
     {
        "_index": "products_color_test",
        "_type": "nms",
        "_id": "3",
        "_score": 0.58422226,
        "_source": {
           "colors": "молочная",
           "subject": {
              "id": 1,
              "value": "футболка"
           }
        }
     },
     {
        "_index": "products_color_test",
        "_type": "nms",
        "_id": "1",
        "_score": 0.568724,
        "_source": {
           "colors": "белая",
           "subject": {
              "id": 1,
              "value": "футболка"
           }
        }
     },
     {
        "_index": "products_color_test",
        "_type": "nms",
        "_id": "4",
        "_score": 0.46449086,
        "_source": {
           "colors": "молочная",
           "subject": {
              "id": 2,
              "value": "куртка" // jacket ----!!!!!----
           }
        }
     }
  ]
  }
}

问题:

  • 好的,同义词的数量与其数量一样多。但是为什么得分不同,取决于候选同义词中的哪个位置?
  • 有没有办法让ES只计算一个同义词,用cross_fileds保留文档结构和multi_match查询?

谢谢!

PS。对不起我的英文

1 个答案:

答案 0 :(得分:0)

似乎像添加

"expand": false

同义词过滤器解决了这个谜题。据我所知 - 这使得ES在索引时只占用第一个同义词,但在搜索时使用整个扩展集。

现在两个交换查询的结果相似,而ES计数同义词只打了一次

        "_explanation": {
           "value": 0.5622277,
           "description": "sum of:",
           "details": [
              {
                 "value": 0.5622277,
                 "description": "sum of:",
                 "details": [
                    {
                       "value": 0.37481847,
                       "description": "max of:",
                       "details": [
                          {
                             "value": 0.37481847,
                             "description": "weight(subject.value:футболка in 0) [PerFieldSimilarity], result of:",
                             "details": [
                                {
                                   "value": 0.37481847,
                                   "description": "score(doc=0,freq=1.0), product of:",
                                   "details": [
                                      {
                                         "value": 0.37481847,
                                         "description": "queryWeight, product of:",
                                         "details": [
                                            {
                                               "value": 2,
                                               "description": "boost",
                                               "details": []
                                            },
                                            {
                                               "value": 1,
                                               "description": "idf(docFreq=3, maxDocs=4)",
                                               "details": []
                                            },
                                            {
                                               "value": 0.18740924,
                                               "description": "queryNorm",
                                               "details": []
                                            }
                                         ]
                                      },
                                      {
                                         "value": 1,
                                         "description": "fieldWeight in 0, product of:",
                                         "details": [
                                            {
                                               "value": 1,
                                               "description": "tf(freq=1.0), with freq of:",
                                               "details": [
                                                  {
                                                     "value": 1,
                                                     "description": "termFreq=1.0",
                                                     "details": []
                                                  }
                                               ]
                                            },
                                            {
                                               "value": 1,
                                               "description": "idf(docFreq=3, maxDocs=4)",
                                               "details": []
                                            },
                                            {
                                               "value": 1,
                                               "description": "fieldNorm(doc=0)",
                                               "details": []
                                            }
                                         ]
                                      }
                                   ]
                                }
                             ]
                          }
                       ]
                    },
                    {
                       "value": 0.18740924,
                       "description": "max of:",
                       "details": [
                          {
                             "value": 0.18740924,
                             "description": "weight(colors:белый in 0) [PerFieldSimilarity], result of:",
                             "details": [
                                {
                                   "value": 0.18740924,
                                   "description": "score(doc=0,freq=1.0), product of:",
                                   "details": [
                                      {
                                         "value": 0.18740924,
                                         "description": "queryWeight, product of:",
                                         "details": [
                                            {
                                               "value": 1,
                                               "description": "idf(docFreq=3, maxDocs=4)",
                                               "details": []
                                            },
                                            {
                                               "value": 0.18740924,
                                               "description": "queryNorm",
                                               "details": []
                                            }
                                         ]
                                      },
                                      {
                                         "value": 1,
                                         "description": "fieldWeight in 0, product of:",
                                         "details": [
                                            {
                                               "value": 1,
                                               "description": "tf(freq=1.0), with freq of:",
                                               "details": [
                                                  {
                                                     "value": 1,
                                                     "description": "termFreq=1.0",
                                                     "details": []
                                                  }
                                               ]
                                            },
                                            {
                                               "value": 1,
                                               "description": "idf(docFreq=3, maxDocs=4)",
                                               "details": []
                                            },
                                            {
                                               "value": 1,
                                               "description": "fieldNorm(doc=0)",
                                               "details": []
                                            }
                                         ]
                                      }
                                   ]
                                }
                             ]
                          }
                       ]
                    }
                 ]
              },
              {
                 "value": 0,
                 "description": "match on required clause, product of:",
                 "details": [
                    {
                       "value": 0,
                       "description": "# clause",
                       "details": []
                    },
                    {
                       "value": 0.18740924,
                       "description": "_type:nms, product of:",
                       "details": [
                          {
                             "value": 1,
                             "description": "boost",
                             "details": []
                          },
                          {
                             "value": 0.18740924,
                             "description": "queryNorm",
                             "details": []
                          }
                       ]
                    }
                 ]
              }
           ]
        }
     },