我有像SimpleDoc000155 / 1这样的标题(字符数不固定但总是后跟9个数字和/或数字),我想知道如何分析这个文件以获得结果:155和SimpleDoc000155。
Elasticsearch是2.2版本
我目前的设置是:
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "autocomplete",
"filter" : [ "code", "lowercase" ]
}
},
"filter": {
"code": {
"type": "pattern_capture",
"preserve_original" : 1,
"patterns": ["([1-9].+(?=\/))"]
}
},
"tokenizer" : {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 6,
"max_gram": 32,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
我得到的结果是
{
"tokens": [{
"token": "simple",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "simpled",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "simpledo",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 2
},
{
"token": "simpledoc",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "simpledoc0",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 4
},
{
"token": "simpledoc00",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 5
},
{
"token": "simpledoc000",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 6
},
{
"token": "simpledoc0001",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 7
},
{
"token": "simpledoc00015",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 8
},
{
"token": "simpledoc000155",
"start_offset": 0,
"end_offset": 15,
"type": "word",
"position": 9
}
]
}
我有点失落。尝试了很多,但我无法回到155,看起来像pattern_capture无法正常工作。
感谢您的回答!
更新
将标记器从Edgengram更改为ngram,有点工作,但有很多不需要的标记。