如何匹配elasticsearch中包含连字符或尾随空格的查询字词

时间:2015-01-28 22:56:33

标签: elasticsearch lucene elastic-beanstalk

在elasticsearch mapping的映射char_filter部分,它有点模糊,我很难理解是否以及如何使用charfilter分析器:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

基本上,我们在索引中存储的数据是String类型的ID,如下所示:"008392342000"。我希望能够在查询字词实际包含连字符或尾随空格时搜索此类ID,如下所示:"008392342-000 "

您如何建议我将分析仪设置为? 目前这是该领域的定义:

"mappings": {
    "client": {
        "properties": {
            "ucn": {
                "type": "multi_field",
                "fields": {
                    "ucn_autoc": {
                        "type": "string",
                        "index": "analyzed",
                        "index_analyzer": "autocomplete_index",
                        "search_analyzer": "autocomplete_search"
                    },
                    "ucn": {
                        "type": "string",
                        "index": "not_analyzed"
                    }
                }
            }
        }
    }
}

以下是包含分析器等的索引的设置。

 "settings": {
        "analysis": {
            "filter": {
                "autocomplete_ngram": {
                    "max_gram": 15,
                    "min_gram": 1,
                    "type": "edge_ngram"
                },
                "ngram_filter": {
                    "type": "nGram",
                    "min_gram": 2,
                    "max_gram": 8
                }
            },
            "analyzer": {
                "lowercase_analyzer": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                },
                "autocomplete_index": {
                    "filter": [
                        "lowercase",
                        "autocomplete_ngram"
                    ],
                    "tokenizer": "keyword"
                },
                "ngram_index": {
                    "filter": [
                        "ngram_filter",
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                },
                "autocomplete_search": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                },
                "ngram_search": {
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "keyword"
                }
            },
            "index": {
                "number_of_shards": 6,
                "number_of_replicas": 1
            }
        }
    }

1 个答案:

答案 0 :(得分:4)

您尚未提供实际的分析仪,数据输入内容以及您的期望值,但根据您提供的信息,我将从此开始:

{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_mapping": {
          "type": "mapping",
          "mappings": [
            "-=>"
          ]
        }
      },
      "analyzer": {
        "autocomplete_search": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_mapping"
          ],
          "filter": [
            "trim"
          ]
        },
        "autocomplete_index": {
          "tokenizer": "keyword",
          "filter": [
            "trim"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "ucn": {
          "type": "multi_field",
          "fields": {
            "ucn_autoc": {
              "type": "string",
              "index": "analyzed",
              "index_analyzer": "autocomplete_index",
              "search_analyzer": "autocomplete_search"
            },
            "ucn": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

char_filter将无效替换--=>。我也会使用trim过滤器去除任何尾随或前导空格。不知道你的autocomplete_index分析仪是什么,我只使用了keyword

测试分析仪GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000会导致:

"tokens": [
      {
         "token": "012334742000",
         "start_offset": 0,
         "end_offset": 17,
         "type": "word",
         "position": 1
      }
   ]

这意味着它确实消除了-和空格。 典型的查询是:

{
  "query": {
    "match": {
      "ucn.ucn_autoc": " 0123-34742-000  "
    }
  }
}