Question

我正在尝试使用elasticsearch来索引一些关于研究论文的数据。但我想点缀一下口音。对于intance，如果我使用：

GET /_analyze?tokenizer=standard&filter=asciifolding&text="Boletínes de investigaciónes"我

{
   "tokens": [
      {
         "token": "Bolet",
         "start_offset": 1,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "nes",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "de",
         "start_offset": 11,
         "end_offset": 13,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "investigaci",
         "start_offset": 14,
         "end_offset": 25,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "nes",
         "start_offset": 26,
         "end_offset": 29,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}

我应该得到类似的东西

{
   "tokens": [
      {
         "token": "Boletines",
         "start_offset": 1,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "de",
         "start_offset": 11,
         "end_offset": 13,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "investigacion",
         "start_offset": 14,
         "end_offset": 25,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

我该怎么办？

Answer 1

为了防止形成额外的令牌，您需要使用替代的令牌化程序，例如试试whitespace tokenizer。

或者使用language analyzer并指定语言。

Elasticsearch和西班牙语口音

1 个答案: