令牌字符映射到Ngram过滤器ElasticSearch NEST

时间:2016-06-28 02:07:21

标签: elasticsearch nest elasticsearch-net

我正在尝试使用NEST复制下面的映射,并在将令牌字符映射到tokenizer时遇到问题。

{
   "settings": {
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "nGram",
               "min_gram": 2,
               "max_gram": 20,
               "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "nGram_filter"
               ]
            }
         }
      }
   }

我能够复制除令牌字符部分之外的所有内容。有人可以帮助这样做。下面是我的代码复制上面的映射。 (除了令牌字符部分)

 var nGramFilters1 = new List<string> { "lowercase", "asciifolding", "nGram_filter" };
 var tChars = new List<string> { "letter", "digit", "punctuation", "symbol" };

    var createIndexResponse = client.CreateIndex(defaultIndex, c => c
                 .Settings(st => st
                 .Analysis(an => an
                 .Analyzers(anz => anz
                 .Custom("nGram_analyzer", cc => cc
                 .Tokenizer("whitespace").Filters(nGramFilters1)))
               .TokenFilters(tf=>tf.NGram("nGram_filter",ng=>ng.MinGram(2).MaxGram(20))))));

参考

  1. SO Question
  2. GitHub Issue

1 个答案:

答案 0 :(得分:5)

NGram Tokenizer 支持令牌字符(token_chars),使用这些字符来确定哪些字符应保留在令牌中,并拆分列表中未显示的任何字符。

另一方面,

NGram Token Filter 对令牌化程序生成的令牌进行操作,因此只能选择应生成的最小和最大克数。

根据您当前的分析链,您可能需要类似以下内容

var createIndexResponse = client.CreateIndex(defaultIndex, c => c
    .Settings(st => st
        .Analysis(an => an
            .Analyzers(anz => anz
                .Custom("ngram_analyzer", cc => cc
                    .Tokenizer("ngram_tokenizer")
                    .Filters(nGramFilters))
                )
            .Tokenizers(tz => tz
                .NGram("ngram_tokenizer", td => td
                    .MinGram(2)
                    .MaxGram(20)
                    .TokenChars(
                        TokenChar.Letter,
                        TokenChar.Digit,
                        TokenChar.Punctuation,
                        TokenChar.Symbol
                    )
                )          
            )
        )
    )
);