弹性搜索-运算符AND的模糊性无法正常工作

时间:2018-10-25 07:11:37

标签: java elasticsearch

在我的弹性搜索中,我的索引文档低于以下内容:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.9589403,
    "hits": [
      {
        "_index": "productcatalog",
        "_type": "doc",
        "_id": "1",
        "_score": 0.9589403,
        "_source": {
          "catalog_id": "343",
          "catalog_type": "series",
          "values": "Activa Rooftop, valves, VG3000, VG3000FS, butterfly, ball"
        }
      },
      {
        "_index": "productcatalog",
        "_type": "doc",
        "_id": "2",
        "_score": 0.6712582,
        "_source": {
          "catalog_id": "12717",
          "catalog_type": "product",
          "values": "Activa Rooftop, valves"
        }
      }
    ]
  }
}

正在触发下面的api查询来搜索Activa Rooftop ball,并且期望响应中只有一个文档同时具有两个Activa Rooftop ball作为值。

GET productcatalog/_search
{
    "query": {
        "match" : {
            "values" : {
                "query" : " activa rooftp ball ",
                "operator" : "and",
                "boost": 1.0,
                "fuzziness": 2,
                "prefix_length": 0,
                "max_expansions": 100


            }
        }
    }
}

但是,我正在获取两个文档作为答复。

请找到我下面的映射文件:

PUT productcatalog
{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "attr_analyzer":{  
               "type":"custom",
               "tokenizer":"letter",
               "char_filter":[  
                  "html_strip"
               ],
               "filter":[  
                  "lowercase",
                  "asciifolding",
                  "stemmer_minimal_english",
                  "stemmer_minimal_german",
                  "stemmer_minimal_french",
                  "stemmer_minimal_norwegian",
                  "stemmer_minimal_portuguese"
               ]
            }
         },
         "filter":{  
            "stemmer_minimal_english":{  
               "type":"stemmer",
               "name":"minimal_english"
            },
            "stemmer_minimal_german":{  
               "type":"stemmer",
               "name":"minimal_german"
            },
            "stemmer_minimal_french":{  
               "type":"stemmer",
               "name":"minimal_french"
            },
            "stemmer_minimal_norwegian":{  
               "type":"stemmer",
               "name":"minimal_norwegian"
            },
            "stemmer_minimal_portuguese":{  
               "type":"stemmer",
               "name":"minimal_portuguese"
            }
         }
      }
   },
   "mappings":{  
      "doc":{  
         "properties":{  
            "values":{  
               "type":"text",
               "analyzer":"attr_analyzer"
            },
            "catalog_type":{  
               "type":"text"
            },
            "catalog_id":{  
               "type":"long"
            }
         }
      }
   }
}

我使用的是6.2.3版本。另外,请为正在使用的同一模糊查询找到我的JavaAPI代码。

 QueryBuilder qb = QueryBuilders.matchQuery("values", keyword).operator(Operator.AND).boost(1.0f).fuzziness(2).prefixLength(0).maxExpansions(100);   

1 个答案:

答案 0 :(得分:3)

您在这里遇到的问题与词干有关。我已经分析了您的attr_analyzer分析器。请在下面看看。

第一次测试:

GET index-52983383/_analyze 
{
  "analyzer": "attr_analyzer", 
  "text":     "Activa Rooftop, valves, VG3000, VG3000FS, butterfly, ball"
}

响应:

{
  "tokens": [
    {
      "token": "activ",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "rooftop",
      "start_offset": 7,
      "end_offset": 14,
      "type": "word",
      "position": 1
    },
    {
      "token": "valv",
      "start_offset": 16,
      "end_offset": 22,
      "type": "word",
      "position": 2
    },
    {
      "token": "vg",
      "start_offset": 24,
      "end_offset": 26,
      "type": "word",
      "position": 3
    },
    {
      "token": "vg",
      "start_offset": 32,
      "end_offset": 34,
      "type": "word",
      "position": 4
    },
    {
      "token": "fs",
      "start_offset": 38,
      "end_offset": 40,
      "type": "word",
      "position": 5
    },
    {
      "token": "butterfly",
      "start_offset": 42,
      "end_offset": 51,
      "type": "word",
      "position": 6
    },
    {
      "token": "ball",
      "start_offset": 53,
      "end_offset": 57,
      "type": "word",
      "position": 7
    }
  ]
}

第二项测试:

GET index-52983383/_analyze 
{
  "analyzer": "attr_analyzer", 
  "text":     "Activa Rooftop, valves"
}

响应:

{
  "tokens": [
    {
      "token": "activ",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "rooftop",
      "start_offset": 7,
      "end_offset": 14,
      "type": "word",
      "position": 1
    },
    {
      "token": "valv",
      "start_offset": 16,
      "end_offset": 22,
      "type": "word",
      "position": 2
    }
  ]
}

如您所见,在两个响应中,您都有valv个令牌。您在搜索词中的valvball之间的Levenshtein距离等于2,这恰好等于您的fuzziness参数。

使用fuzziness时,您通常需要以某种方式妥协。在其他情况下,您将遇到类似的情况。也许考虑将AUTO值而不是2用作fuzziness?如果您不是我在说的,请看看documentation。其他选项可以是将prefix_length至少设置为1,这样始终需要匹配第一个字符。您需要进行相同的测试,然后决定哪种方法最适合您。