ElasticSearch edge-ngram和模式过滤器

时间:2018-04-03 13:18:39

标签: elasticsearch elasticsearch-2.0 elasticsearch-analyzers

我有像SimpleDoc000155 / 1这样的标题(字符数不固定但总是后跟9个数字和/或数字),我想知道如何分析这个文件以获得结果:155和SimpleDoc000155。

Elasticsearch是2.2版本

我目前的设置是:

"analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "autocomplete",
          "filter" : [ "code", "lowercase" ]
                }
            },
            "filter": {
                "code": {
                    "type": "pattern_capture",
          "preserve_original" : 1,
                    "patterns": ["([1-9].+(?=\/))"]
                }
            },
      "tokenizer" : {
      "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 6,
          "max_gram": 32,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
        }
    }

我得到的结果是

{
    "tokens": [{
            "token": "simple",
            "start_offset": 0,
            "end_offset": 6,
            "type": "word",
            "position": 0
        },
        {
            "token": "simpled",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 1
        },
        {
            "token": "simpledo",
            "start_offset": 0,
            "end_offset": 8,
            "type": "word",
            "position": 2
        },
        {
            "token": "simpledoc",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 3
        },
        {
            "token": "simpledoc0",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 4
        },
        {
            "token": "simpledoc00",
            "start_offset": 0,
            "end_offset": 11,
            "type": "word",
            "position": 5
        },
        {
            "token": "simpledoc000",
            "start_offset": 0,
            "end_offset": 12,
            "type": "word",
            "position": 6
        },
        {
            "token": "simpledoc0001",
            "start_offset": 0,
            "end_offset": 13,
            "type": "word",
            "position": 7
        },
        {
            "token": "simpledoc00015",
            "start_offset": 0,
            "end_offset": 14,
            "type": "word",
            "position": 8
        },
        {
            "token": "simpledoc000155",
            "start_offset": 0,
            "end_offset": 15,
            "type": "word",
            "position": 9
        }
    ]
}

我有点失落。尝试了很多,但我无法回到155,看起来像pattern_capture无法正常工作。

感谢您的回答!

更新

将标记器从Edgengram更改为ngram,有点工作,但有很多不需要的标记。

0 个答案:

没有答案