在我的弹性搜索中,我的索引文档低于以下内容:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.9589403,
"hits": [
{
"_index": "productcatalog",
"_type": "doc",
"_id": "1",
"_score": 0.9589403,
"_source": {
"catalog_id": "343",
"catalog_type": "series",
"values": "Activa Rooftop, valves, VG3000, VG3000FS, butterfly, ball"
}
},
{
"_index": "productcatalog",
"_type": "doc",
"_id": "2",
"_score": 0.6712582,
"_source": {
"catalog_id": "12717",
"catalog_type": "product",
"values": "Activa Rooftop, valves"
}
}
]
}
}
正在触发下面的api查询来搜索Activa Rooftop ball
,并且期望响应中只有一个文档同时具有两个Activa Rooftop ball
作为值。
GET productcatalog/_search
{
"query": {
"match" : {
"values" : {
"query" : " activa rooftp ball ",
"operator" : "and",
"boost": 1.0,
"fuzziness": 2,
"prefix_length": 0,
"max_expansions": 100
}
}
}
}
但是,我正在获取两个文档作为答复。
请找到我下面的映射文件:
PUT productcatalog
{
"settings":{
"analysis":{
"analyzer":{
"attr_analyzer":{
"type":"custom",
"tokenizer":"letter",
"char_filter":[
"html_strip"
],
"filter":[
"lowercase",
"asciifolding",
"stemmer_minimal_english",
"stemmer_minimal_german",
"stemmer_minimal_french",
"stemmer_minimal_norwegian",
"stemmer_minimal_portuguese"
]
}
},
"filter":{
"stemmer_minimal_english":{
"type":"stemmer",
"name":"minimal_english"
},
"stemmer_minimal_german":{
"type":"stemmer",
"name":"minimal_german"
},
"stemmer_minimal_french":{
"type":"stemmer",
"name":"minimal_french"
},
"stemmer_minimal_norwegian":{
"type":"stemmer",
"name":"minimal_norwegian"
},
"stemmer_minimal_portuguese":{
"type":"stemmer",
"name":"minimal_portuguese"
}
}
}
},
"mappings":{
"doc":{
"properties":{
"values":{
"type":"text",
"analyzer":"attr_analyzer"
},
"catalog_type":{
"type":"text"
},
"catalog_id":{
"type":"long"
}
}
}
}
}
我使用的是6.2.3版本。另外,请为正在使用的同一模糊查询找到我的JavaAPI代码。
QueryBuilder qb = QueryBuilders.matchQuery("values", keyword).operator(Operator.AND).boost(1.0f).fuzziness(2).prefixLength(0).maxExpansions(100);
答案 0 :(得分:3)
您在这里遇到的问题与词干有关。我已经分析了您的attr_analyzer
分析器。请在下面看看。
第一次测试:
GET index-52983383/_analyze
{
"analyzer": "attr_analyzer",
"text": "Activa Rooftop, valves, VG3000, VG3000FS, butterfly, ball"
}
响应:
{
"tokens": [
{
"token": "activ",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "rooftop",
"start_offset": 7,
"end_offset": 14,
"type": "word",
"position": 1
},
{
"token": "valv",
"start_offset": 16,
"end_offset": 22,
"type": "word",
"position": 2
},
{
"token": "vg",
"start_offset": 24,
"end_offset": 26,
"type": "word",
"position": 3
},
{
"token": "vg",
"start_offset": 32,
"end_offset": 34,
"type": "word",
"position": 4
},
{
"token": "fs",
"start_offset": 38,
"end_offset": 40,
"type": "word",
"position": 5
},
{
"token": "butterfly",
"start_offset": 42,
"end_offset": 51,
"type": "word",
"position": 6
},
{
"token": "ball",
"start_offset": 53,
"end_offset": 57,
"type": "word",
"position": 7
}
]
}
第二项测试:
GET index-52983383/_analyze
{
"analyzer": "attr_analyzer",
"text": "Activa Rooftop, valves"
}
响应:
{
"tokens": [
{
"token": "activ",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "rooftop",
"start_offset": 7,
"end_offset": 14,
"type": "word",
"position": 1
},
{
"token": "valv",
"start_offset": 16,
"end_offset": 22,
"type": "word",
"position": 2
}
]
}
如您所见,在两个响应中,您都有valv
个令牌。您在搜索词中的valv
和ball
之间的Levenshtein距离等于2,这恰好等于您的fuzziness
参数。
使用fuzziness
时,您通常需要以某种方式妥协。在其他情况下,您将遇到类似的情况。也许考虑将AUTO
值而不是2用作fuzziness
?如果您不是我在说的,请看看documentation。其他选项可以是将prefix_length
至少设置为1,这样始终需要匹配第一个字符。您需要进行相同的测试,然后决定哪种方法最适合您。