如果文本包含#,@等特殊字符,则自定义标记生成器不会按预期生成标记

时间:2015-07-15 09:52:09

标签: elasticsearch token tokenize analyzer

我定义了以下标记生成器:

PUT /testanlyzer2
{
        "settings" : {
            "analysis" : {
                "analyzer" : {
                    "my_ngram_analyzer" : {
                        "tokenizer" : "my_ngram_tokenizer"
                    }
                },
                "tokenizer" : {
                    "my_ngram_tokenizer" : {
                        "type" : "nGram",
                        "min_gram" : "1",
                        "max_gram" : "3",
                        "token_chars": [ "letter", "digit","symbol","currency_symbol","modifier_symbol","other_symbol" ]
                    }
                }
            }
        }
    }

For the following request 
  GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"

Result is:

{
   "tokens": [
      {
         "token": "i",
         "start_offset": 1,
         "end_offset": 2,
         "type": "word",
         "position": 1
      },
      {
         "token": "a",
         "start_offset": 3,
         "end_offset": 4,
         "type": "word",
         "position": 2
      }
   ]
}

对于以下请求::

GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"

结果是::

{
   "tokens": [
      {
         "token": "i",
         "start_offset": 1,
         "end_offset": 2,
         "type": "word",
         "position": 1
      },
      {
         "token": "a",
         "start_offset": 3,
         "end_offset": 4,
         "type": "word",
         "position": 2
      }
   ]
}

对于以下请求::

GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a@m not available 9177"

结果是:

Request failed to get to the server (status code: 0):

Expected result should contain these special characters(@,#,currency's,etc..) as tokens. please correct me if anything wrong in my custom tokenizer.

- 由于

0 个答案:

没有答案