我具有以下同义词扩展名:
suco => suco, refresco, bebida de soja
我想要以这种方式标记搜索:
搜索“ suco de laranja”将被标记为[“ suco”,“ laranja”,“ refresco”,“ bebida de soja”]。
但是我将其标记为[“ suco”,“ laranja”,“ refresco”,“ bebida”,“ soja”]。
请考虑“ de ”是停用词。我希望在“ bebida de laranja”成为[“ bebida”,“ laranja”]之类的查询中被忽略。但是我不希望在同义词标记化上考虑它,因此“ bebida de soja”仍然保留为一个标记“ bebida de soja”。
我的设置:
{
"settings":{
"analysis":{
"filter":{
"synonym_br":{
"type":"synonym",
"synonyms":[
"suco => suco, refresco, bebida de soja"
]
},
"brazilian_stop":{
"type":"stop",
"stopwords":"_brazilian_"
}
},
"analyzer":{
"synonyms":{
"filter":[
"synonym_br",
"lowercase",
"brazilian_stop",
"asciifolding"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}
答案 0 :(得分:0)
我建议您进行以下两项更改。第一个与您提出的问题直接相关,第二个是建议。
执行相反的操作,而不是对单个单词使用多个同义词的扩展,即所有同义词都指向单个单词的同义词。请注意,没有同义词是单个世界,集合是字母的某种组合。
所以,
将"suco => suco, refresco, bebida de soja"
更改为"suco, refresco, bebida de soja => suco"
在synonyms
分析器中更改过滤器的顺序。将lowercase
放在synonym_br
之前。这样可以确保大小写不会影响synonym_br
令牌过滤器。
因此最终设置将是:
{
"settings": {
"analysis": {
"filter": {
"synonym_br": {
"type": "synonym",
"synonyms": [
"suco, refresco, bebida de soja => suco"
]
},
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
}
},
"analyzer": {
"synonyms": {
"filter": [
"lowercase",
"synonym_br",
"brazilian_stop",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}
对于输入bebida de soja
过滤器,请按以下顺序应用:
Input Filter Result tokens
====================================
lowercase bebida, de, soja
synonym_br suco <------- all the above tokens(including position) exactly matches a synonym
brazilian_stop suco
asciifolding suco
让我们看看brazilian_stop
的实际作用。为此,我们需要一个与同义词不匹配但包含de
的输入。例如。 de soja
:
Input Filter Result tokens
=================================
lowercase de, soja
synonym_br de, soja <------- none of the tokens (independently or combined(including position)) matches any synonym
brazilian_stop suco <------- de is removed as it is a stopword
asciifolding suco