Question

我正在尝试为索引设置新映射。这将支持由ES提供支持的部分关键字搜索和自动完成请求。

带有空格标记器的 edgeNGram标记过滤器似乎是一种方法。直到现在我的设置看起来像这样：

curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
    "index": {
        "analysis": {
            "analyzer": {
                "customNgram": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "customNgram"]
                }
            },
            "filter": {
                "customNgram": {
                    "type": "edgeNGram",
                    "min_gram": "3",
                    "max_gram": "18",
                    "side": "front"
                }
            }
        }
    }
}
}'

问题出在日语单词上！ NGrams是否适用于日文字母？例如：【11月13日13时まで，フォロー＆安培;！RTで応募】

此处没有空格 - 文档无法使用部分关键字进行搜索，是预期的吗？

Answer 1

你可能想看一下icu_tokenizer，它增加了对外语的支持https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html

将文本标记为单词边界上的单词，如UAX＃29中所定义： Unicode文本分段。它的行为很像标准 tokenizer，但是通过使用a可以为某些亚洲语言添加更好的支持基于字典的方法识别泰语，老挝语，中文，日本人和韩国人，并使用自定义规则打破缅甸和高棉文字成音节。

PUT icu_sample

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}

请注意，要在索引中使用它，您需要安装相应的插件：

bin/elasticsearch-plugin install analysis-icu

将此添加到您的代码中：

curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
    "index": {
        "analysis": {
            "analyzer": {
                "customNgram": {
                    "type": "custom",
                    "tokenizer": "icu_tokenizer",
                    "filter": ["lowercase", "customNgram"]
                }
            },
            "filter": {
                "customNgram": {
                    "type": "edgeNGram",
                    "min_gram": "3",
                    "max_gram": "18",
                    "side": "front"
                }
            }
        }
    }
}
}'

通常你会使用standard分析器搜索这样的自动完成，而是使用icu_tokenizer（但不是edgeNGram过滤器）将分析器添加到你的映射中并将其应用于您在搜索时的查询，或明确将其设置为您search_analyzer所应用字段的customNgram。

Elasticsearch：edgeNGram令牌过滤器是否适用于非英语令牌？

1 个答案: