Question

我试图实现一个Elasticsearch pattern_capture过滤器，它可以将EDR-00004转换为令牌：[EDR-00004,00004,4]。我（仍在）使用Elasticsearch 2.4，但与当前ES版本的文档没有区别。

我已按照文档中的示例进行操作： https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-pattern-capture-tokenfilter.html

这是我的测试和结果：

curl -XPUT 'localhost:9200/test_index' -d '{
    "settings": {
        "analysis": {
            "filter": {
                "process_number_filter": {
                    "type": "pattern_capture",
                    "preserve_original": 1,
                    "patterns": [
                        "([A-Za-z]+-([0]+([0-9]+)))"
                    ]
                }
            },
            "analyzer": {
                "process_number_analyzer": {
                    "type": "custom",
                    "tokenizer": "pattern",
                    "filter": ["process_number_filter"]
                }
            }
        }
    }
}'

curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
  "analyzer": "process_number_analyzer",
  "text": "EDR-00002"
}'

curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
  "analyzer": "standard",
  "tokenizer": "standard",
  "filter": ["process_number_filter"],
  "text": "EDR-00002"
}'

返回：

{"acknowledged":true}

{
    "tokens": [{
        "token": "EDR",
        "start_offset": 0,
        "end_offset": 3,
        "type": "word",
        "position": 0
    }, {
        "token": "00002",
        "start_offset": 4,
        "end_offset": 9,
        "type": "word",
        "position": 1
    }]
}

{
    "tokens": [{
        "token": "edr",
        "start_offset": 0,
        "end_offset": 3,
        "type": "<ALPHANUM>",
        "position": 0
    }, {
        "token": "00002",
        "start_offset": 4,
        "end_offset": 9,
        "type": "<NUM>",
        "position": 1
    }]
}

我理解

我不必将整个正则表达式分组，因为我有preserve_original set
我可以用\ d和/或\ w替换内容，但这样我就不必考虑转义了。

还要确保我的正则表达式是正确的。

>>> m = re.match(r"([A-Za-z]+-([0]+([0-9]+)))", "EDR-00004")                                                                                                                                                                                
>>> m.groups()
('EDR-00004', '00004', '4')

Answer 1

我讨厌回答我自己的问题，但我找到了答案，也许它可以帮助将来的人。

我的问题是默认的tokenizer，它会在将文本传递给我的过滤器之前拆分文本。通过添加我自己的标记生成器（将默认拆分器"\W+"覆盖到"[^\\w-]+"），我的过滤器会收到整个单词，从而创建了正确的标记。

这是我的自定义设置：

curl -XPUT 'localhost:9200/test_index' -d '{
    "settings": {
        "analysis": {
            "filter": {
                "process_number_filter": {
                    "type": "pattern_capture",
                    "preserve_original": 1,
                    "patterns": [
                        "([A-Za-z]+-([0]+([0-9]+)))"
                    ]
                }
            },
            "tokenizer": {
                "process_number_tokenizer": {
                    "type": "pattern",
                    "pattern": "[^\\w-]+"
                }
            },
            "analyzer": {
                "process_number_analyzer": {
                    "type": "custom",
                    "tokenizer": "process_number_tokenizer",
                    "filter": ["process_number_filter"]
                }
            }
        }
    }
}'

导致以下结果：

{
    "tokens": [
        {
            "token": "EDR-00002",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "00002",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "2",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        }
    ]
}

无法实现pattern_capture标记过滤器

1 个答案: