如何在Elasticsearch(5.2.0)上正确设置索引以便使用音译和词形还原?

时间:2018-04-30 16:03:32

标签: elasticsearch transliteration lemmatization

我正在尝试在Elasticsearch(V5.2.0)上正确设置索引并使用词形还原的优点。我的索引看起来像这样:

PUT /icu 
{
"settings":{
    "index":{
        "analysis":{
            "filter":{
                "latin_transform":{
                    "type":"icu_transform",
                    "id":"Any-Latin; Lower()"
                },
                "lemmagen_filter_sr":{
                    "type":"lemmagen",
                    "lexicon":"sr"
                }
            },
            "analyzer":{
                "lemmagen_lowercase_sr":{
                    "filter":[
                        "lemmagen_filter_sr",
                        "latin_transform"
                    ],
                    "type":"custom",
                    "tokenizer":"standard"
                }
            }
        }
    }
}

我已安装https://github.com/vhyza/elasticsearch-analysis-lemmagen 但是当我尝试分析某些内容时,似乎如果文本是西里尔语,那么只有 latin_transform 过滤器没有 lemmagen_filter_sr ,如果文本是latin,则应用 lemmagen_filter_sr 正确的单词。

以下是一个例子:

POST icu/_analyze
{  
    "analyzer":"lemmagen_lowercase_sr",
    "text":"реду раду и дисциплини redu i disciplini"
}

我收到以下内容:

{
  "tokens": [
    {
      "token": "redu",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "radu",
      "start_offset": 5,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "i",
      "start_offset": 10,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "disciplini",
      "start_offset": 12,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "red",
      "start_offset": 23,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "i",
      "start_offset": 28,
      "end_offset": 29,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "disciplina",
      "start_offset": 30,
      "end_offset": 40,
      "type": "<ALPHANUM>",
      "position": 6
    }
  ]
}

可以看出,前四个单词已被音译为拉丁语而没有应用词形还原,最后三个单词最初是拉丁语脚本并且发生了词形还原。如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

经过一些尝试后,我找到了解决方法。我没有在分析器中使用两个过滤器,而是将latin_transform移动到char_filter中,首先通过映射进行音译,然后应用词形还原。新的分析仪现在看起来像这样:

import pandas as pd 
s = pd.Series(['do not-remove this-hyphen but remove-all of these-hyphens'])

list_to_keep =['not-remove', 'this-hyphen']

serbian_mapping.txt包含音译密钥对,可以解决问题。