我正在尝试使用NEST复制下面的映射,并在将令牌字符映射到tokenizer时遇到问题。
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
}
}
}
}
我能够复制除令牌字符部分之外的所有内容。有人可以帮助这样做。下面是我的代码复制上面的映射。 (除了令牌字符部分)
var nGramFilters1 = new List<string> { "lowercase", "asciifolding", "nGram_filter" };
var tChars = new List<string> { "letter", "digit", "punctuation", "symbol" };
var createIndexResponse = client.CreateIndex(defaultIndex, c => c
.Settings(st => st
.Analysis(an => an
.Analyzers(anz => anz
.Custom("nGram_analyzer", cc => cc
.Tokenizer("whitespace").Filters(nGramFilters1)))
.TokenFilters(tf=>tf.NGram("nGram_filter",ng=>ng.MinGram(2).MaxGram(20))))));
参考
答案 0 :(得分:5)
NGram Tokenizer 支持令牌字符(token_chars
),使用这些字符来确定哪些字符应保留在令牌中,并拆分列表中未显示的任何字符。
NGram Token Filter 对令牌化程序生成的令牌进行操作,因此只能选择应生成的最小和最大克数。
根据您当前的分析链,您可能需要类似以下内容
var createIndexResponse = client.CreateIndex(defaultIndex, c => c
.Settings(st => st
.Analysis(an => an
.Analyzers(anz => anz
.Custom("ngram_analyzer", cc => cc
.Tokenizer("ngram_tokenizer")
.Filters(nGramFilters))
)
.Tokenizers(tz => tz
.NGram("ngram_tokenizer", td => td
.MinGram(2)
.MaxGram(20)
.TokenChars(
TokenChar.Letter,
TokenChar.Digit,
TokenChar.Punctuation,
TokenChar.Symbol
)
)
)
)
)
);