以下是跨机构进行模糊搜索的设置:
{
"analysis": {
"filter": {
"edgeNGramFilter": {
"type": "nGram",
"min_gram": 1,
"max_gram": 20
},
"institutes_stopwords": {
"type": "stop",
"stopwords": ["College", "University", "Engineering", "of", "Institute", "Technology"]
},
"word_joiner": {
"type": "word_delimiter",
"catenate_all": true
},
"specialchars_remover": {
"type":"pattern_replace",
"pattern": "[^A-Za-z0-9]",
"replacement": " "
}
},
"analyzer": {
"whitespaceAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"institutes_stopwords"
]
},
"edgeNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edgeNGramFilter",
"institutes_stopwords",
"word_joiner",
"specialchars_remover"
]
}
}
}
}
映射为,
"list": {
"properties": {
"id": {
"type":"string",
"index":"not_analyzed"
},
"s_no": {
"type":"string",
"index":"not_analyzed"
},
"institute": {
"type": "multi_field",
"fields": {
"institute": {
"type": "string",
"analyzer": "standard",
"index_analyzer": "standard",
"search_analyzer": "standard",
"filter": "word_joiner",
"boost": 10.0
},
"partial": {
"type": "string",
"analyzer": "edgeNGramAnalyzer",
"index_analyzer": "standard",
"filter": "word_joiner",
"search_analyzer": "edgeNGramAnalyzer",
"boost": 1.0
}
}
}
因此,当我使用以下查询查询学院名称时,
{
"query":{
"match":{
"institute":{
"query":"A V C College of Engg",
"fuzziness":3,
"minimum_should_match":"-40%",
"boost":5
}
}
}
}
对于完全不同的机构来说,它的效果更好;而对于密切相关的机构,如麻省理工学院,有一些误报,例如“VIT学院”等。作为最佳结果出现。
其他情景包括:
* MVC Engineering College is same as MVC Engg College
* MVC Engineering College is same as M.V.C Engineering College
* MVC Engineering College is same as M V C Engineering College
我应该对设置进行任何更改,还是要对查询进行任何更正?