我想知道elasticsearch如何标记除英语以外的其他语言,并且我尝试了它提供的分析api。但我根本无法理解输出。举个例子
GET myindex/_analyze?analyzer=hindi&text="में कहता हूँ और तुम सुनना "
现在在上面的文字中共有6个单词,所以我希望最多6个标记(相信文本中没有停止单词),但输出有点像这样
{
"tokens": [
{
"token": "2350",
"start_offset": 3,
"end_offset": 7,
"type": "<NUM>",
"position": 1
},
{
"token": "2375",
"start_offset": 10,
"end_offset": 14,
"type": "<NUM>",
"position": 2
},
{
"token": "2306",
"start_offset": 17,
"end_offset": 21,
"type": "<NUM>",
"position": 3
},
{
"token": "2325",
"start_offset": 25,
"end_offset": 29,
"type": "<NUM>",
"position": 4
},
{
"token": "2361",
"start_offset": 32,
"end_offset": 36,
"type": "<NUM>",
"position": 5
},
{
"token": "2340",
"start_offset": 39,
"end_offset": 43,
"type": "<NUM>",
"position": 6
},
{
"token": "2366",
"start_offset": 46,
"end_offset": 50,
"type": "<NUM>",
"position": 7
},
{
"token": "2361",
"start_offset": 54,
"end_offset": 58,
"type": "<NUM>",
"position": 8
},
{
"token": "2370",
"start_offset": 61,
"end_offset": 65,
"type": "<NUM>",
"position": 9
},
{
"token": "2305",
"start_offset": 68,
"end_offset": 72,
"type": "<NUM>",
"position": 10
},
{
"token": "2324",
"start_offset": 76,
"end_offset": 80,
"type": "<NUM>",
"position": 11
},
{
"token": "2352",
"start_offset": 83,
"end_offset": 87,
"type": "<NUM>",
"position": 12
},
{
"token": "2340",
"start_offset": 91,
"end_offset": 95,
"type": "<NUM>",
"position": 13
},
{
"token": "2369",
"start_offset": 98,
"end_offset": 102,
"type": "<NUM>",
"position": 14
},
{
"token": "2350",
"start_offset": 105,
"end_offset": 109,
"type": "<NUM>",
"position": 15
},
{
"token": "2360",
"start_offset": 113,
"end_offset": 117,
"type": "<NUM>",
"position": 16
},
{
"token": "2369",
"start_offset": 120,
"end_offset": 124,
"type": "<NUM>",
"position": 17
},
{
"token": "2344",
"start_offset": 127,
"end_offset": 131,
"type": "<NUM>",
"position": 18
},
{
"token": "2344",
"start_offset": 134,
"end_offset": 138,
"type": "<NUM>",
"position": 19
},
{
"token": "2366",
"start_offset": 141,
"end_offset": 145,
"type": "<NUM>",
"position": 20
}
]
}
这意味着代替六个弹性搜索检测到大约20个令牌和所有类型NUM(我不知道那是什么) 我真的很困惑为什么会这样。有人可以告诉我发生了什么。我做错了什么或者我缺乏理解?
答案 0 :(得分:1)
你是如何调用elasticsearch API的?印地语人物可能会被你的客户搞砸了?
对于我来说,这对我来说没问题(至少印地语字符出现在结果中)与卷曲:
curl -XPOST 'http://localhost:9200/myindex/_analyze?analyzer=hindi&pretty' -d 'में कहता हूँ और तुम सुनना '
{
"tokens" : [ {
"token" : "कह",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "हुं",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "तुम",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 5
}, {
"token" : "सुन",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 6
} ]
}