Question

我正在尝试在Elasticsearch（V5.2.0）上正确设置索引并使用词形还原的优点。我的索引看起来像这样：

PUT /icu 
{
"settings":{
    "index":{
        "analysis":{
            "filter":{
                "latin_transform":{
                    "type":"icu_transform",
                    "id":"Any-Latin; Lower()"
                },
                "lemmagen_filter_sr":{
                    "type":"lemmagen",
                    "lexicon":"sr"
                }
            },
            "analyzer":{
                "lemmagen_lowercase_sr":{
                    "filter":[
                        "lemmagen_filter_sr",
                        "latin_transform"
                    ],
                    "type":"custom",
                    "tokenizer":"standard"
                }
            }
        }
    }
}

我已安装https://github.com/vhyza/elasticsearch-analysis-lemmagen 但是当我尝试分析某些内容时，似乎如果文本是西里尔语，那么只有 latin_transform 过滤器没有 lemmagen_filter_sr ，如果文本是latin，则应用 lemmagen_filter_sr 正确的单词。

以下是一个例子：

POST icu/_analyze
{  
    "analyzer":"lemmagen_lowercase_sr",
    "text":"реду раду и дисциплини redu i disciplini"
}

我收到以下内容：

{
  "tokens": [
    {
      "token": "redu",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "radu",
      "start_offset": 5,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "i",
      "start_offset": 10,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "disciplini",
      "start_offset": 12,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "red",
      "start_offset": 23,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "i",
      "start_offset": 28,
      "end_offset": 29,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "disciplina",
      "start_offset": 30,
      "end_offset": 40,
      "type": "<ALPHANUM>",
      "position": 6
    }
  ]
}

可以看出，前四个单词已被音译为拉丁语而没有应用词形还原，最后三个单词最初是拉丁语脚本并且发生了词形还原。如何解决这个问题？

Answer 1

经过一些尝试后，我找到了解决方法。我没有在分析器中使用两个过滤器，而是将latin_transform移动到char_filter中，首先通过映射进行音译，然后应用词形还原。新的分析仪现在看起来像这样：

import pandas as pd 
s = pd.Series(['do not-remove this-hyphen but remove-all of these-hyphens'])

list_to_keep =['not-remove', 'this-hyphen']

serbian_mapping.txt包含音译密钥对，可以解决问题。

如何在Elasticsearch（5.2.0）上正确设置索引以便使用音译和词形还原？

1 个答案: