针对国际语言的Elasticsearch标记化

时间:2014-11-29 17:26:03

标签: elasticsearch tokenize

我想知道elasticsearch如何标记除英语以外的其他语言,并且我尝试了它提供的分析api。但我根本无法理解输出。举个例子

GET myindex/_analyze?analyzer=hindi&text="में कहता हूँ और तुम सुनना "

现在在上面的文字中共有6个单词,所以我希望最多6个标记(相信文本中没有停止单词),但输出有点像这样

 {
   "tokens": [
      {
         "token": "2350",
         "start_offset": 3,
         "end_offset": 7,
         "type": "<NUM>",
         "position": 1
      },
      {
         "token": "2375",
         "start_offset": 10,
         "end_offset": 14,
         "type": "<NUM>",
         "position": 2
      },
      {
         "token": "2306",
         "start_offset": 17,
         "end_offset": 21,
         "type": "<NUM>",
         "position": 3
      },
      {
         "token": "2325",
         "start_offset": 25,
         "end_offset": 29,
         "type": "<NUM>",
         "position": 4
      },
      {
         "token": "2361",
         "start_offset": 32,
         "end_offset": 36,
         "type": "<NUM>",
         "position": 5
      },
      {
         "token": "2340",
         "start_offset": 39,
         "end_offset": 43,
         "type": "<NUM>",
         "position": 6
      },
      {
         "token": "2366",
         "start_offset": 46,
         "end_offset": 50,
         "type": "<NUM>",
         "position": 7
      },
      {
         "token": "2361",
         "start_offset": 54,
         "end_offset": 58,
         "type": "<NUM>",
         "position": 8
      },
      {
         "token": "2370",
         "start_offset": 61,
         "end_offset": 65,
         "type": "<NUM>",
         "position": 9
      },
      {
         "token": "2305",
         "start_offset": 68,
         "end_offset": 72,
         "type": "<NUM>",
         "position": 10
      },
      {
         "token": "2324",
         "start_offset": 76,
         "end_offset": 80,
         "type": "<NUM>",
         "position": 11
      },
      {
         "token": "2352",
         "start_offset": 83,
         "end_offset": 87,
         "type": "<NUM>",
         "position": 12
      },
      {
         "token": "2340",
         "start_offset": 91,
         "end_offset": 95,
         "type": "<NUM>",
         "position": 13
      },
      {
         "token": "2369",
         "start_offset": 98,
         "end_offset": 102,
         "type": "<NUM>",
         "position": 14
      },
      {
         "token": "2350",
         "start_offset": 105,
         "end_offset": 109,
         "type": "<NUM>",
         "position": 15
      },
      {
         "token": "2360",
         "start_offset": 113,
         "end_offset": 117,
         "type": "<NUM>",
         "position": 16
      },
      {
         "token": "2369",
         "start_offset": 120,
         "end_offset": 124,
         "type": "<NUM>",
         "position": 17
      },
      {
         "token": "2344",
         "start_offset": 127,
         "end_offset": 131,
         "type": "<NUM>",
         "position": 18
      },
      {
         "token": "2344",
         "start_offset": 134,
         "end_offset": 138,
         "type": "<NUM>",
         "position": 19
      },
      {
         "token": "2366",
         "start_offset": 141,
         "end_offset": 145,
         "type": "<NUM>",
         "position": 20
      }
   ]
}

这意味着代替六个弹性搜索检测到大约20个令牌和所有类型NUM(我不知道那是什么) 我真的很困惑为什么会这样。有人可以告诉我发生了什么。我做错了什么或者我缺乏理解?

1 个答案:

答案 0 :(得分:1)

你是如何调用elasticsearch API的?印地语人物可能会被你的客户搞砸了?

对于我来说,这对我来说没问题(至少印地语字符出现在结果中)与卷曲:

curl -XPOST 'http://localhost:9200/myindex/_analyze?analyzer=hindi&pretty' -d 'में कहता हूँ और तुम सुनना '
{
  "tokens" : [ {
    "token" : "कह",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "हुं",
    "start_offset" : 9,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "तुम",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 5
  }, {
    "token" : "सुन",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 6
  } ]
}