我试图实现谷歌风格自动完成&用弹性搜索自动修正。
映射:
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"search_analyzer": "simple",
"payloads": true
}
}
}
}
}
文档:
POST music/song
{
"song_field" : "beautiful queen",
"suggest" : "beautiful queen"
}
POST music/song
{
"song_field" : "beautiful",
"suggest" : "beautiful"
}
我希望当用户输入时:" beaatiful q
"他会得到类似beautiful queen
的东西(beaatiful被纠正为美丽而q被完成为女王)。
我尝试过以下查询:
POST music/song/_search?search_type=dfs_query_then_fetch
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatiful q",
"completion": {
"field": "suggest"
}
}
},
"query": {
"match": {
"song_field": {
"query": "beaatiful q",
"fuzziness": 2
}
}
}
}
不幸的是,Completion suggester不允许任何拼写错误,所以我得到了回复:
"suggest": {
"didYouMean": [
{
"text": "beaatiful q",
"offset": 0,
"length": 11,
"options": []
}
]
}
此外,搜索给了我这些结果(虽然用户开始编写"女王"但美丽排名更高):
"hits": [
{
"_index": "music",
"_type": "song",
"_id": "AVUj4Y5NancUpEdFLeLo",
"_score": 0.51315063,
"_source": {
"song_field": "beautiful"
"suggest": "beautiful"
}
},
{
"_index": "music",
"_type": "song",
"_id": "AVUj4XFAancUpEdFLeLn",
"_score": 0.32071912,
"_source": {
"song_field": "beautiful queen"
"suggest": "beautiful queen"
}
}
]
更新!!!
我发现我可以使用模糊查询和完成建议器,但现在查询时我没有得到任何建议(模糊只支持2个编辑距离):
POST music/song/_search
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatefal q",
"completion": {
"field": "suggest",
"fuzzy" : {
"fuzziness" : 2
}
}
}
}
}
我仍然期待" beautiful queen
"作为建议回应。
答案 0 :(得分:1)
当你想提供2个或更多单词作为搜索建议时,我发现了(困难的方法),在Elasticsearch中使用ngrams或edgengrams是不值得的。
使用Shingles token filter和shingles analyzer将为您提供多字短语,如果您将其与match_phrase_prefix结合使用,它应该为您提供所需的功能。
基本上是这样的:
this
不要忘记进行映射:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
}
Ngrams和edgengrams将标记单个字符,而Shingles分析器和过滤器,分组字母(制作单词)并提供更有效的方式来生成和搜索短语。我花了很多时间搞乱上面的2,直到我看到Shingles提到并阅读它。好多了。