我试图实现一个Elasticsearch pattern_capture过滤器,它可以将EDR-00004转换为令牌:[EDR-00004,00004,4]。我(仍在)使用Elasticsearch 2.4,但与当前ES版本的文档没有区别。
我已按照文档中的示例进行操作: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-pattern-capture-tokenfilter.html
这是我的测试和结果:
curl -XPUT 'localhost:9200/test_index' -d '{
"settings": {
"analysis": {
"filter": {
"process_number_filter": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([A-Za-z]+-([0]+([0-9]+)))"
]
}
},
"analyzer": {
"process_number_analyzer": {
"type": "custom",
"tokenizer": "pattern",
"filter": ["process_number_filter"]
}
}
}
}
}'
curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
"analyzer": "process_number_analyzer",
"text": "EDR-00002"
}'
curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
"analyzer": "standard",
"tokenizer": "standard",
"filter": ["process_number_filter"],
"text": "EDR-00002"
}'
返回:
{"acknowledged":true}
{
"tokens": [{
"token": "EDR",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}, {
"token": "00002",
"start_offset": 4,
"end_offset": 9,
"type": "word",
"position": 1
}]
}
{
"tokens": [{
"token": "edr",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
}, {
"token": "00002",
"start_offset": 4,
"end_offset": 9,
"type": "<NUM>",
"position": 1
}]
}
我理解
还要确保我的正则表达式是正确的。
>>> m = re.match(r"([A-Za-z]+-([0]+([0-9]+)))", "EDR-00004")
>>> m.groups()
('EDR-00004', '00004', '4')
答案 0 :(得分:0)
我讨厌回答我自己的问题,但我找到了答案,也许它可以帮助将来的人。
我的问题是默认的tokenizer,它会在将文本传递给我的过滤器之前拆分文本。通过添加我自己的标记生成器(将默认拆分器"\W+"
覆盖到"[^\\w-]+"
),我的过滤器会收到整个单词,从而创建了正确的标记。
这是我的自定义设置:
curl -XPUT 'localhost:9200/test_index' -d '{
"settings": {
"analysis": {
"filter": {
"process_number_filter": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([A-Za-z]+-([0]+([0-9]+)))"
]
}
},
"tokenizer": {
"process_number_tokenizer": {
"type": "pattern",
"pattern": "[^\\w-]+"
}
},
"analyzer": {
"process_number_analyzer": {
"type": "custom",
"tokenizer": "process_number_tokenizer",
"filter": ["process_number_filter"]
}
}
}
}
}'
导致以下结果:
{
"tokens": [
{
"token": "EDR-00002",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "00002",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "2",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}