如何使用Elasticsearch正确处理多词同义词扩展?

时间:2019-05-01 23:53:13

标签: elasticsearch elastic-stack elasticsearch-5

我具有以下同义词扩展名:

suco => suco, refresco, bebida de soja

我想要以这种方式标记搜索:

搜索“ suco de laranja”将被标记为[“ suco”,“ laranja”,“ refresco”,“ bebida de soja”]。

但是我将其标记为[“ suco”,“ laranja”,“ refresco”,“ bebida”,“ soja”]。

请考虑“ de ”是停用词。我希望在“ bebida de laranja”成为[“ bebida”,“ laranja”]之类的查询中被忽略。但是我不希望在同义词标记化上考虑它,因此“ bebida de soja”仍然保留为一个标记“ bebida de soja”。

我的设置:

{
    "settings":{
        "analysis":{
            "filter":{
                "synonym_br":{
                    "type":"synonym",
                    "synonyms":[
                        "suco => suco, refresco, bebida de soja"
                    ]
                },
                "brazilian_stop":{
                    "type":"stop",
                    "stopwords":"_brazilian_"
                }
            },
            "analyzer":{
                "synonyms":{
                    "filter":[
                        "synonym_br",
                        "lowercase",
                        "brazilian_stop",
                        "asciifolding"
                    ],
                    "type":"custom",
                    "tokenizer":"standard"
                }
            }
        }
    }
}

1 个答案:

答案 0 :(得分:0)

我建议您进行以下两项更改。第一个与您提出的问题直接相关,第二个是建议。

  1. 执行相反的操作,而不是对单个单词使用多个同义词的扩展,即所有同义词都指向单个单词的同义词。请注意,没有同义词是单个世界,集合是字母的某种组合。 所以, 将"suco => suco, refresco, bebida de soja"更改为"suco, refresco, bebida de soja => suco"

  2. synonyms分析器中更改过滤器的顺序。将lowercase放在synonym_br之前。这样可以确保大小写不会影响synonym_br令牌过滤器。

因此最终设置将是:

{
  "settings": {
    "analysis": {
      "filter": {
        "synonym_br": {
          "type": "synonym",
          "synonyms": [
            "suco, refresco, bebida de soja => suco"
          ]
        },
        "brazilian_stop": {
          "type": "stop",
          "stopwords": "_brazilian_"
        }
      },
      "analyzer": {
        "synonyms": {
          "filter": [
            "lowercase",
            "synonym_br",
            "brazilian_stop",
            "asciifolding"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}

这是如何工作的?

对于输入bebida de soja过滤器,请按以下顺序应用:

Input Filter        Result tokens
====================================
lowercase           bebida, de, soja
synonym_br          suco             <------- all the above tokens(including position) exactly matches a synonym
brazilian_stop      suco
asciifolding        suco

让我们看看brazilian_stop的实际作用。为此,我们需要一个与同义词不匹配但包含de的输入。例如。 de soja

Input Filter        Result tokens
=================================
lowercase           de, soja
synonym_br          de, soja  <------- none of the tokens (independently or combined(including position)) matches any synonym
brazilian_stop      suco      <------- de is removed as it is a stopword
asciifolding        suco