我试图在我的网站上进行有效的自动完成搜索输入,以搜索城市。我假设人们将始终以正确的单词顺序搜索他们的城市名称。
例如。住在Saint-Maur
的用户会输入sai..
,但绝不会首先输入mau..
。
如果结果以查询中的术语开头,我需要提高结果得分。例如。如果用户输入pari
,则城市Parigné-le-Pôlin
应该得分高于Fontenay-en-Parisis
,因为它以pari
开头。
我使用了边缘克数过滤器和短语匹配,因为单词的顺序很重要。我确信我的问题有一个简单的解决方案,但我是ES魔术界的新手:)
这是我的映射:
{
"settings": {
"index": {
"number_of_shards": 1
},
"analysis": {
"analyzer": {
"partialPostalCodeAnalyzer": {
"tokenizer": "standard",
"filter": ["partialFilter"]
},
"partialNameAnalyzer": {
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase", "word_delimiter", "partialFilter"]
},
"searchAnalyzer": {
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase", "word_delimiter"]
}
},
"filter": {
"partialFilter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50
}
}
}
},
"mappings": {
"village": {
"properties": {
"postalCode": {
"type": "string",
"index_analyzer": "partialPostalCodeAnalyzer",
"search_analyzer": "searchAnalyzer"
},
"name": {
"type": "string",
"index_analyzer": "partialNameAnalyzer",
"search_analyzer": "searchAnalyzer"
},
"population": {
"type": "integer",
"index": "not_analyzed"
}
}
}
}
}
一些样本:
PUT /tv_village/village/1 {"name": "Paris"}
PUT /tv_village/village/2 {"name": "Parigny"}
PUT /tv_village/village/3 {"name": "Fontenay-en-Parisis"}
PUT /tv_village/village/4 {"name": "Parigné-le-Pôlin"}
如果我执行此查询,您可以看到结果不符合我想要的顺序(我希望第4个结果在3d之前):
GET /tv_village/village/_search
{
"query": {
"match_phrase": {
"name": "pari"
}
}
}
结果:
"hits": [
{
"_index": "tv_village",
"_type": "village",
"_id": "1",
"_score": 0.7768564,
"_source": {
"name": "Paris"
}
},
{
"_index": "tv_village",
"_type": "village",
"_id": "2",
"_score": 0.7768564,
"_source": {
"name": "Parigny"
}
},
{
"_index": "tv_village",
"_type": "village",
"_id": "3",
"_score": 0.3884282,
"_source": {
"name": "Fontenay-en-Parisis"
}
},
{
"_index": "tv_village",
"_type": "village",
"_id": "4",
"_score": 0.3884282,
"_source": {
"name": "Parigné-le-Pôlin"
}
}
]
答案 0 :(得分:0)
在映射定义中,放置另一个分析器:
"keywordLowercaseAnalyer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
意思是,保持单词完整(通过keyword
分析器)并将其小写(如“parigné-le-pôlin”)。
然后为您的name
字段定义另外两个字段:
raw
应为not_analyzed
应使用raw_lowercase
keywordLowercaseAnalyer
"name": {
"type": "string",
"index_analyzer": "partialNameAnalyzer",
"search_analyzer": "searchAnalyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"raw_lowercase": {
"type": "string",
"analyzer": "keywordLowercaseAnalyer"
}
}
}
我这样做是因为你可以搜索“pari”或“Pari”。在您的查询中,使用rescore
功能根据其他查询重新计算评分:
{
"query": {
"match_phrase": {
"name": "pari"
}
},
"rescore": {
"query": {
"rescore_query": {
"bool": {
"should": [
{"prefix": {"name.raw": "pari"}},
{"prefix": {"name.raw_lowercase": "pari"}}
]
}
}
}
}
}
从您的用例角度和prefix
查询有两个缺点:
prefix
的值为not_analyzed
,这就是添加这两个raw*
字段的原因:一个字段处理小写版本,另一个字段处理未触及的版本因此,对“pari”或“Pari”的查询涵盖了这些情况。我有两点建议:
window_size
rescore
查询属性来限制执行重新计算的值的数量,从而提高性能。供您参考,这是documentation page for rescore
。