I am using the pattern_capture
filter to preserve all the acronyms
PUT test_index/_settings
{
"index.analysis.filter": {
"acronym_en_EN": {
"type": "pattern_capture",
"patterns": [
"(?:[a-zA-Z]\\.)+",
"((?:[a-zA-Z]\\.)+[a-zA-Z])",
"((?:[a-zA-Z]\\.)+[s]$)",
"((?:[a-zA-Z]\\.)+[s][\\.]$)"
],
"preserve_original": true
}
}
}
But i noticed that acronyms that end with s
or s.
are stemmed as there is one stemmer filter also attached to the analyzer. The regular expressions in the filter above for handling s
are also not working.
I test the output using this
GET test_index/_analyze?tokenizer=standard&filters=lowercase,acronym_en_EN,apostrophe,porter_stemmer_en_EN&text=u.s.a. u.s. s.w.a.t u.t.
this gives me
{
"tokens": [
{
"token": "u.s.a",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "u.",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "u.",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "s.w.a.t",
"start_offset": 12,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "u.t",
"start_offset": 20,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
}
]
}
Is there any way I can preserve the acronyms ending with s
so that for u.s.
or u.s
I don't get u.
?
答案 0 :(得分:1)
我不认为这是开箱即用的。我相信这样做的方法是教jhbuild build: could not download https://git.gnome.org/browse/jhbuild/plain/modulesets/gnome-apps-3.18.modules: <urlopen error Tunnel connection failed: 407 Proxy Authentication Required>
过滤器如何将其捕获标记为pattern_capture
过滤器keyword
标记。
老实说,你可能会在两个keyword_marker
令牌过滤器的同时破解一些东西 - 一个在词干分析器的两侧。只需在首字母缩略词的前面拍一个pattern_replace
或其他东西,然后在另一边撕掉它。