Ngram过滤器的工作方式与我想象的不同

时间:2016-04-11 02:36:15

标签: elasticsearch

我使用以下映射创建了一个索引

curl -XPUT http://ubuntu:9200/ngram-test -d '{
    "settings": {
        "analysis": {
            "filter": {
                "mynGram": {
                    "type": "nGram",
                    "min_gram": 1,
                    "max_gram": 10,
                    "token_chars": [ "letter", "digit" ]
                }
            },
            "analyzer": {
                "domain_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "mynGram"]
                }
            }
        }
    },
    "mappings": {
        "assets": {
            "properties": {
                "domain": {
                    "type": "string",
                    "analyzer": "domain_analyzer"
                },
                "tag": {
                    "include_in_parent": true,
                    "type": "nested",
                    "properties": {
                        "name": {
                            "type": "string",
                            "analyzer": "domain_analyzer"
                        }
                    }
                }
            }
        }
    }
}'; echo

然后我添加了一些文件,

curl http://ubuntu:9200/ngram-test/assets/ -d '{
  "domain": "www.example.com",
  "tag": [
    {
      "name": "IIS"
    },
    {
      "name": "Microsoft ASP.NET"
    }
  ]
}'; echo

但是从查询验证,

http://ubuntu:9200/ngram-test/_validate/query?q=tag.name:asp.net&explain

查询已成为这个,

filtered(tag.name:a tag.name:as tag.name:asp tag.name:asp. tag.name:asp.n tag.name:asp.ne tag.name:asp.net tag.name:s tag.name:sp tag.name:sp. tag.name:sp.n tag.name:sp.ne tag.name:sp.net tag.name:p tag.name:p. tag.name:p.n tag.name:p.ne tag.name:p.net tag.name:. tag.name:.n tag.name:.ne tag.name:.net tag.name:n tag.name:ne tag.name:net tag.name:e tag.name:et tag.name:t)->cache(org.elasticsearch.index.search.nested.NonNestedDocsFilter@ad04e78f)

完全出乎意料。我期待asp.net**asp.net*asp.net*类似查询,而非tag.name:a

这意味着当我查询asp.net时,alex之类的内容也会出现在搜索结果中,这是完全错误的。

我错过了什么吗?

修改

我将min_gram增加到5,并添加了search_analyzer

        "tag": {
            "include_in_parent": true,
            "type": "nested",
            "properties": {
                "name": {
                    "type": "string",
                    "analyzer": "domain_analyzer",
                    "search_analyzer": "standard"
                }
            }
        }

但是从验证来看,它仍然是意外的:

# http://ubuntu:9200/tag-test/assets/_validate/query?explain&q=tag.name:microso
filtered(tag.name:micro tag.name:micros tag.name:microso tag.name:icros tag.name:icroso tag.name:croso)->cache(_type:assets)

嗯......它仍然包含对icros icroso croso

的搜索

1 个答案:

答案 0 :(得分:1)

nGram令牌过滤器会在角色级别拆分令牌。如果您只需要分词,那么您的空白分词器就已经完成了这项任务。

使用elyzer tool,您可以深入了解分析过程的每个步骤。使用您的分析仪,它产生了这个:

.advancedSearch, .advancedSearch:focus {
  outline: none !important;
  background-color: #e9ece5;
  font-family: Arial, Times, serif;
  color: #333333;
  height: 35px;
}

虽然你似乎愿意更喜欢这样的事情:

> elyzer --es localhost:9200 --index ngram --analyzer domain_analyzer --text "Microsoft ASP.NET"

TOKENIZER: whitespace
{1:Microsoft}   {2:ASP.NET} 
TOKEN_FILTER: lowercase
{1:microsoft}   {2:asp.net} 
TOKEN_FILTER: mynGram
{1:m,mi,mic,micr,micro,micros,microso,microsof,microsoft,i,ic,icr,icro,icros,icroso,icrosof,icrosoft,c,cr,cro,cros,croso,crosof,crosoft,r,ro,ros,roso,rosof,rosoft,o,os,oso,osof,osoft,s,so,sof,soft,o,of,oft,f,ft,t}   {2:a,as,asp,asp.,asp.n,asp.ne,asp.net,s,sp,sp.,sp.n,sp.ne,sp.net,p,p.,p.n,p.ne,p.net,.,.n,.ne,.net,n,ne,net,e,et,t}

这可以通过从分析器中删除TOKENIZER: whitespace {1:Microsoft} {2:ASP.NET} TOKEN_FILTER: lowercase {1:microsoft} {2:asp.net} 令牌过滤器来实现。