我使用以下映射创建了一个索引
curl -XPUT http://ubuntu:9200/ngram-test -d '{
"settings": {
"analysis": {
"filter": {
"mynGram": {
"type": "nGram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [ "letter", "digit" ]
}
},
"analyzer": {
"domain_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "mynGram"]
}
}
}
},
"mappings": {
"assets": {
"properties": {
"domain": {
"type": "string",
"analyzer": "domain_analyzer"
},
"tag": {
"include_in_parent": true,
"type": "nested",
"properties": {
"name": {
"type": "string",
"analyzer": "domain_analyzer"
}
}
}
}
}
}
}'; echo
然后我添加了一些文件,
curl http://ubuntu:9200/ngram-test/assets/ -d '{
"domain": "www.example.com",
"tag": [
{
"name": "IIS"
},
{
"name": "Microsoft ASP.NET"
}
]
}'; echo
但是从查询验证,
http://ubuntu:9200/ngram-test/_validate/query?q=tag.name:asp.net&explain
查询已成为这个,
filtered(tag.name:a tag.name:as tag.name:asp tag.name:asp. tag.name:asp.n tag.name:asp.ne tag.name:asp.net tag.name:s tag.name:sp tag.name:sp. tag.name:sp.n tag.name:sp.ne tag.name:sp.net tag.name:p tag.name:p. tag.name:p.n tag.name:p.ne tag.name:p.net tag.name:. tag.name:.n tag.name:.ne tag.name:.net tag.name:n tag.name:ne tag.name:net tag.name:e tag.name:et tag.name:t)->cache(org.elasticsearch.index.search.nested.NonNestedDocsFilter@ad04e78f)
完全出乎意料。我期待asp.net*
或*asp.net
或*asp.net*
类似查询,而非tag.name:a
,
这意味着当我查询asp.net
时,alex
之类的内容也会出现在搜索结果中,这是完全错误的。
我错过了什么吗?
我将min_gram增加到5,并添加了search_analyzer
"tag": {
"include_in_parent": true,
"type": "nested",
"properties": {
"name": {
"type": "string",
"analyzer": "domain_analyzer",
"search_analyzer": "standard"
}
}
}
但是从验证来看,它仍然是意外的:
# http://ubuntu:9200/tag-test/assets/_validate/query?explain&q=tag.name:microso
filtered(tag.name:micro tag.name:micros tag.name:microso tag.name:icros tag.name:icroso tag.name:croso)->cache(_type:assets)
嗯......它仍然包含对icros icroso croso
答案 0 :(得分:1)
nGram令牌过滤器会在角色级别拆分令牌。如果您只需要分词,那么您的空白分词器就已经完成了这项任务。
使用elyzer tool,您可以深入了解分析过程的每个步骤。使用您的分析仪,它产生了这个:
.advancedSearch, .advancedSearch:focus {
outline: none !important;
background-color: #e9ece5;
font-family: Arial, Times, serif;
color: #333333;
height: 35px;
}
虽然你似乎愿意更喜欢这样的事情:
> elyzer --es localhost:9200 --index ngram --analyzer domain_analyzer --text "Microsoft ASP.NET"
TOKENIZER: whitespace
{1:Microsoft} {2:ASP.NET}
TOKEN_FILTER: lowercase
{1:microsoft} {2:asp.net}
TOKEN_FILTER: mynGram
{1:m,mi,mic,micr,micro,micros,microso,microsof,microsoft,i,ic,icr,icro,icros,icroso,icrosof,icrosoft,c,cr,cro,cros,croso,crosof,crosoft,r,ro,ros,roso,rosof,rosoft,o,os,oso,osof,osoft,s,so,sof,soft,o,of,oft,f,ft,t} {2:a,as,asp,asp.,asp.n,asp.ne,asp.net,s,sp,sp.,sp.n,sp.ne,sp.net,p,p.,p.n,p.ne,p.net,.,.n,.ne,.net,n,ne,net,e,et,t}
这可以通过从分析器中删除TOKENIZER: whitespace
{1:Microsoft} {2:ASP.NET}
TOKEN_FILTER: lowercase
{1:microsoft} {2:asp.net}
令牌过滤器来实现。