Elasticsearch,如何连接单词然后ngram呢?

时间:2014-12-01 07:51:46

标签: elasticsearch n-gram

我想连接单词然后ngram它 弹性搜索的正确setting是什么?

英文,

来自:stack overflow

==> stackoverflow:首先连接,

==> sta / tac / ack / cko / kov / ...等等(min_gram:3,max_gram:10)

1 个答案:

答案 0 :(得分:2)

要进行连接,我假设您只想从输入数据中删除所有空格。要做到这一点,您需要实现一个pattern_replace char filter来替换空间。

设置ngram tokenizer应该很简单 - 只需指定令牌最小/最大长度。

值得添加lowercase token filter - 以使搜索不区分大小写。

curl -XPOST localhost:9200/my_index -d '{
  "index": {
    "analysis": {
        "analyzer": {
            "my_new_analyzer": {
                "filter": [
                    "lowercase"
                ],
                "tokenizer": "my_ngram_tokenizer",
                "char_filter" : ["my_pattern"],
                "type": "custom"
            }
        },
       "char_filter" : {
          "my_pattern":{
            "type":"pattern_replace",
            "pattern":"\u0020",
            "replacement":""
           }
        }, 
        "tokenizer" : {
                "my_ngram_tokenizer" : {
                    "type" : "nGram",
                    "min_gram" : "3",
                    "max_gram" : "10",
                    "token_chars": ["letter", "digit", "punctuation", "symbol"]
                }
            }
    }
  }
}'

测试这个:

curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_new_analyzer&pretty' -d 'stack overflow'

给出以下内容(只是下面显示的一小部分):

{
"tokens" : [ {
  "token" : "sta",
  "start_offset" : 0,
  "end_offset" : 3,
  "type" : "word",
  "position" : 1
}, {
  "token" : "stac",
  "start_offset" : 0,
  "end_offset" : 4,
  "type" : "word",
  "position" : 2
}, {
  "token" : "stack",
  "start_offset" : 0,
  "end_offset" : 6,
  "type" : "word",
  "position" : 3
}, {
  "token" : "stacko",
  "start_offset" : 0,
  "end_offset" : 7,
  "type" : "word",
  "position" : 4
}, {
  "token" : "stackov",
  "start_offset" : 0,
  "end_offset" : 8,
  "type" : "word",
  "position" : 5
}, {