要求是创建一个自定义分析器,该分析器可以生成两个令牌,如下所示。
例如
Input -> B.tech in
Output Tokens ->
- btechin
- b.tech in
我能够删除非字母数字字符,但是如何在输出令牌列表中也保留原始字符。下面是我创建的自定义分析器。
"alphanumericStringAnalyzer": {
"filter": [
"lowercase",
"minLength_filter"],
"char_filter": [
"specialCharactersFilter"
],
"type": "custom",
"tokenizer": "keyword"
}
"char_filter": {
"specialCharactersFilter": {
"pattern": "[^A-Za-z0-9]",
"type": "pattern_replace",
"replacement": ""
}
},
此分析器正在为输入“ B.tech in”生成单个令牌“ btechin”,但我也希望令牌列表“ B.tech in”中也有原始令牌。
谢谢!
答案 0 :(得分:4)
您可以按照documentation
所述使用令牌分隔符一词下面是单词定界符配置的示例:
POST _analyze
{
"text": "B.tech in",
"tokenizer": "keyword",
"filter": [
"lowercase",
{
"type": "word_delimiter",
"catenate_all": true,
"preserve_original": true,
"generate_word_parts": false
}
]
}
结果:
{
"tokens": [
{
"token": "b.tech in",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "btechin",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
希望它能满足您的要求!