ElasticSearch分析器匹配“Java”,“脚本”和“JavaScript”

时间:2016-06-06 15:32:17

标签: elasticsearch

索引值:Java, JavaScript, ClojureScript

_input_    | _output_
Java       | JavaScript, Java
JavaScript | JavaScript
script     | JavaScript, ClojureScript

大多数已接近所需结果的分析仪如下。

"analysis": {
    "filter": {
        "trigrams_filter": {
            "type": "edge_ngram",
            "min_gram": "3",
            "max_gram": "3"
        }
    },
    "analyzer": {
        "trigrams": {
            "filter": [
                "lowercase",
                "trigrams_filter"
            ],
            "type": "custom",
            "tokenizer": "standard"
        }
    }
}

但它不够准确,因为“JavaScript”返回“JavaScript”和“Java” 并且“脚本”什么都不返回。

1 个答案:

答案 0 :(得分:1)

您的映射存在一个主要问题:您希望使用edge_ngram过滤器来搜索单词的一部分。当您想要查找以查询值开头的单词时,使用Edge_ngram过滤器。在您的情况下,您应该使用nGram过滤器:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenfilter.html

此外,您应该只在数据为索引时指定trigrams分析器。为了搜索它最好使用标准分析器,因为没有意义通过nGram过滤器放置查询字符串,因为你将获得比你需要的更多的数据。

正确的映射:

POST /so
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "filter": {
            "trigrams_filter": {
               "type": "nGram",
               "min_gram": "2",
               "max_gram": "20"
            }
         },
         "analyzer": {
            "trigrams": {
               "filter": [
                  "lowercase",
                  "trigrams_filter"
               ],
               "type": "custom",
               "tokenizer": "standard"
            }
         }
      }
   },
   "mappings": {
       "so" :{
           "properties": {
               "text": {
                   "type": "string",
                    "analyzer": "trigrams",
                    "search_analyzer": "standard"
               }
           }
       }
   }
}

值:

POST /so/so/1
{
    "text" :"Java"
}
POST /so/so/2
{
    "text" :"JavaScript"
}
POST /so/so/3
{
    "text" :"ClojureScript"
}

当您的查询字符串为“java”时,响应包含:Java和JavaScript

POST /so/so/_search
{
    "query": {"match": {
       "text": "Java"
    }}
}

响应:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1,
      "hits": [
         {
            "_index": "so",
            "_type": "so",
            "_id": "1",
            "_score": 1,
            "_source": {
               "text": "Java"
            }
         },
         {
            "_index": "so",
            "_type": "so",
            "_id": "2",
            "_score": 1,
            "_source": {
               "text": "JavaScript"
            }
         }
      ]
   }
}

当您的查询字符串为“JavaScript”时,响应包含:JavaScript

POST /so/so/_search
{
    "query": {"match": {
       "text": " JavaScript "
    }}
}

响应:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1.4054651,
      "hits": [
         {
            "_index": "so",
            "_type": "so",
            "_id": "2",
            "_score": 1.4054651,
            "_source": {
               "text": "JavaScript"
            }
         }
      ]
   }
}

当您的查询字符串是“script”时,响应包含:JavaScript和ClojureScript

POST /so/so/_search
{
    "query": {"match": {
       "text": "script"
    }}
}

响应:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1,
      "hits": [
         {
            "_index": "so",
            "_type": "so",
            "_id": "2",
            "_score": 1,
            "_source": {
               "text": "JavaScript"
            }
         },
         {
            "_index": "so",
            "_type": "so",
            "_id": "3",
            "_score": 1,
            "_source": {
               "text": "ClojureScript"
            }
         }
      ]
   }
}