Avoid stemming of Acronyms?

时间:2015-09-01 22:08:07

标签: elasticsearch dsl text-mining

I am using the pattern_capture filter to preserve all the acronyms

PUT test_index/_settings
{
  "index.analysis.filter": {
    "acronym_en_EN": {
      "type": "pattern_capture",
      "patterns": [
        "(?:[a-zA-Z]\\.)+", 
        "((?:[a-zA-Z]\\.)+[a-zA-Z])",
        "((?:[a-zA-Z]\\.)+[s]$)",
        "((?:[a-zA-Z]\\.)+[s][\\.]$)"
        ],
      "preserve_original": true
    }
  }
}

But i noticed that acronyms that end with s or s. are stemmed as there is one stemmer filter also attached to the analyzer. The regular expressions in the filter above for handling s are also not working.

I test the output using this

GET test_index/_analyze?tokenizer=standard&filters=lowercase,acronym_en_EN,apostrophe,porter_stemmer_en_EN&text=u.s.a. u.s. s.w.a.t u.t. 

this gives me

{
   "tokens": [
      {
         "token": "u.s.a",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "u.",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "u.",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "s.w.a.t",
         "start_offset": 12,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "u.t",
         "start_offset": 20,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

Is there any way I can preserve the acronyms ending with s so that for u.s. or u.s I don't get u.?

1 个答案:

答案 0 :(得分:1)

我不认为这是开箱即用的。我相信这样做的方法是教jhbuild build: could not download https://git.gnome.org/browse/jhbuild/plain/modulesets/gnome-apps-3.18.modules: <urlopen error Tunnel connection failed: 407 Proxy Authentication Required> 过滤器如何将其捕获标记为pattern_capture过滤器keyword标记。

老实说,你可能会在两个keyword_marker令牌过滤器的同时破解一些东西 - 一个在词干分析器的两侧。只需在首字母缩略词的前面拍一个pattern_replace或其他东西,然后在另一边撕掉它。