Elasticsearch - Analyzer创建正确的令牌,但查询不匹配

时间:2018-03-21 11:28:11

标签: elasticsearch

我试图让Elasticsearch忽略连字符。我不希望它将连字符的任何一边分成单独的单词。看起来很简单,但是我把头撞在墙上。

我想要字符串" Roland JD-Xi"产生以下条款: [roland jd-xi,roland,jd-xi,jdxi,roland jdxi]

我无法轻易实现这一目标。大多数人只会键入' jdxi'所以我最初的想法就是删除连字符。所以我使用以下定义

  name: {
"type": "string",
"analyzer": "language",
"include_in_all": true,
"boost": 5,
"fields": {
    "my_standard": {
        "type": "string",
        "analyzer": "my_standard"
    },
    "my_prefix": {
        "type": "string",
        "analyzer": "my_text_prefix",
        "search_analyzer": "my_standard"
    },
    "my_suffix": {
        "type": "string",
        "analyzer": "my_text_suffix",
        "search_analyzer": "my_standard"
    }
}

}

相关的分析器和过滤器定义为

{
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
    "analyzer": {
        "std": {
            "tokenizer": "standard",
            "char_filter": "html_strip",
            "filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "length", "strip_hyphens"]
        ...
        "my_text_prefix": {
            "tokenizer": "whitespace",
            "char_filter": "my_filter",
            "filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_front"]
        },
        "my_text_suffix": {
            "tokenizer": "whitespace",
            "char_filter": "my_filter",
            "filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_back"]
        },
        "my_standard": {
            "type": "custom",
            "tokenizer": "whitespace",
            "char_filter": "my_filter",
            "filter": ["standard", "elision", "asciifolding", "lowercase"]
        }
    },
    "char_filter": {
        "my_filter": {
            "type": "mapping",
            "mappings": ["- => ", ". => "]
        }
    },
    "filter": {
        "edge_ngram_front": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 20,
            "side": "front"
        },
        "edge_ngram_back": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 20,
            "side": "back"
        },
        "strip_spaces": {
            "type": "pattern_replace",
            "pattern": "\\s",
            "replacement": ""
        },
        "strip_dots": {
            "type": "pattern_replace",
            "pattern": "\\.",
            "replacement": ""
        },
        "strip_hyphens": {
            "type": "pattern_replace",
            "pattern": "-",
            "replacement": ""
        },
        "stop": {
            "type": "stop",
            "stopwords": "_none_"
        },
        "length": {
            "type": "length",
            "min": 1
        }
    }
}

我已经能够测试(即_analyze)这个和字符串" Roland JD-Xi"被标记为 [roland,jdxi]

这不完全是我想要的,但足够接近它应该匹配' jdxi'。

但这就是我的问题。如果我做一个简单的" index / _search?q = jdxi"它没有带回文件。但是,如果我做了" index / _search?q = roland + jdxi"它确实带回了文件。

所以至少我知道连字符被删除了,但是如果令牌" roland"和" jdxi"正在创建怎样来" index / _search?q = jdxi"与文件不匹配?

  1. 我的索引流程或查询流程有问题吗?
  2. 我该如何解决?
  3. 任何人都可以解释如何实现所需的令牌 [roland jd-xi,roland,jd-xi,jdxi,roland jdxi]

1 个答案:

答案 0 :(得分:3)

我在ES 6上复制了您的案例,并且搜索index/_search?q=jdxi会返回该文档。

问题可能是在搜索index/_search?q=jdxi而未指定字段时,它基本上会在_all中搜索包含name字段中的内容(与{{基本相同) 1}})。由于未使用index/_search?q=name:jdxi分析器分析该字段,因此无法获得任何结果。

您应该做的是使用my_standard子字段进行搜索,即my_standard,并且非常确定您会获得该文档。