我正在尝试在Elasticsearch(V5.2.0)上正确设置索引并使用词形还原的优点。我的索引看起来像这样:
PUT /icu
{
"settings":{
"index":{
"analysis":{
"filter":{
"latin_transform":{
"type":"icu_transform",
"id":"Any-Latin; Lower()"
},
"lemmagen_filter_sr":{
"type":"lemmagen",
"lexicon":"sr"
}
},
"analyzer":{
"lemmagen_lowercase_sr":{
"filter":[
"lemmagen_filter_sr",
"latin_transform"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}
我已安装https://github.com/vhyza/elasticsearch-analysis-lemmagen 但是当我尝试分析某些内容时,似乎如果文本是西里尔语,那么只有 latin_transform 过滤器没有 lemmagen_filter_sr ,如果文本是latin,则应用 lemmagen_filter_sr 正确的单词。
以下是一个例子:
POST icu/_analyze
{
"analyzer":"lemmagen_lowercase_sr",
"text":"реду раду и дисциплини redu i disciplini"
}
我收到以下内容:
{
"tokens": [
{
"token": "redu",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "radu",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "i",
"start_offset": 10,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "disciplini",
"start_offset": 12,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "red",
"start_offset": 23,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "i",
"start_offset": 28,
"end_offset": 29,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "disciplina",
"start_offset": 30,
"end_offset": 40,
"type": "<ALPHANUM>",
"position": 6
}
]
}
可以看出,前四个单词已被音译为拉丁语而没有应用词形还原,最后三个单词最初是拉丁语脚本并且发生了词形还原。如何解决这个问题?
答案 0 :(得分:0)
经过一些尝试后,我找到了解决方法。我没有在分析器中使用两个过滤器,而是将latin_transform移动到char_filter中,首先通过映射进行音译,然后应用词形还原。新的分析仪现在看起来像这样:
import pandas as pd
s = pd.Series(['do not-remove this-hyphen but remove-all of these-hyphens'])
list_to_keep =['not-remove', 'this-hyphen']
serbian_mapping.txt包含音译密钥对,可以解决问题。