优先考虑某些字段的ES搜索结果

时间:2019-10-17 04:15:32

标签: elasticsearch

我正在使用elasticsearch-6.4.3。我创建了一个索引Array.split

flight-location_methods

上面的摘录来自我为索引创建的 settings index: { analysis: { "filter": { "autocomplete_filter": { "type": "edge_ngram", "min_gram": 1, "max_gram": 20 } }, "analyzer": { "autocomplete": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase", "autocomplete_filter"] } } } } mapping do indexes :airport_code, type: "text", analyzer: "autocomplete", search_analyzer: "standard" indexes :airport_name, type: "text", analyzer: "autocomplete", search_analyzer: "standard" indexes :city_name, type: "text", analyzer: "autocomplete", search_analyzer: "standard" indexes :country_name, type: "text", analyzer: "autocomplete", search_analyzer: "standard" end 的ruby代码。

当我执行此查询时:

represents the mapping

我得到这个结果:

GET /flight-location_methods/_search
{
  "from": 0,
  "size": 1000,
  "query": {
    "function_score": {
      "functions": [
        {
          "filter": {
            "match": {
              "city_name": "new yo"
            }
          },
          "weight": 50
        },
        {
          "filter": {
            "match": {
              "country_name": "new yo"
            }
          },
          "weight": 50
        }
      ],
      "max_boost": 200,
      "score_mode": "max",
      "boost_mode": "multiply",
      "min_score": 10
    }
  }
}

您可以看到 { "_index": "flight-location_methods", "_type": "_doc", "_id": "tcoj1G0Bdo5Q9AduxCKi", "_score": 50, "_source": { "airport_name": "Ouvea", "airport_code": "UVE", "city_name": "Ouvea", "country_name": "New Caledonia" } }, { "_index": "flight-location_methods", "_type": "_doc", "_id": "zMoj1G0Bdo5Q9AduxCKi", "_score": 50, "_source": { "airport_name": "Palmerston North", "airport_code": "PMR", "city_name": "Palmerston North", "country_name": "New Zealand" } }, { "_index": "flight-location_methods", "_type": "_doc", "_id": "1Moj1G0Bdo5Q9AduxCKi", "_score": 50, "_source": { "airport_name": "Westport", "airport_code": "WSZ", "city_name": "Westport", "country_name": "New Zealand" } }, { "_index": "flight-location_methods", "_type": "_doc", "_id": "1coj1G0Bdo5Q9AduxCKi", "_score": 50, "_source": { "airport_name": "Whangarei", "airport_code": "WRE", "city_name": "Whangarei", "country_name": "New Zealand" } }, { "_index": "flight-location_methods", "_type": "_doc", "_id": "Rsoj1G0Bdo5Q9AduxCOi", "_score": 50, "_source": { "airport_name": "Municipal", "airport_code": "RNH", "city_name": "New Richmond", "country_name": "United States" } }, { "_index": "flight-location_methods", "_type": "_doc", "_id": "fsoj1G0Bdo5Q9AduxCOi", "_score": 50, "_source": { "airport_name": "New London", "airport_code": "GON", "city_name": "New London", "country_name": "United States" } }, { "_index": "flight-location_methods", "_type": "_doc", "_id": "gMoj1G0Bdo5Q9AduxCOi", "_score": 50, "_source": { "airport_name": "New Ulm", "airport_code": "ULM", "city_name": "New Ulm", "country_name": "United States" } }, { "_index": "flight-location_methods", "_type": "_doc", "_id": "5coj1G0Bdo5Q9AduxCSi", "_score": 50, "_source": { "airport_name": "Cape Newenham", "airport_code": "EHM", "city_name": "Cape Newenham", "country_name": "United States" } }, { "_index": "flight-location_methods", "_type": "_doc", "_id": "Ycoj1G0Bdo5Q9AduxCWi", "_score": 50, "_source": { "airport_name": "East 60th Street H/P", "airport_code": "JRE", "city_name": "New York", "country_name": "United States" } } ,但实际上不是。

我也New York should be on top,因为如果搜索文本中有多个单词,我希望搜索文本中的任何单词都出现在任何字段中。但是,如果所有搜索文本都在一个字段中,则优先级应该更高。

1 个答案:

答案 0 :(得分:2)

让我们首先讨论Elasticsearch标记化器和标记化过程:

  

令牌生成器接收字符流,将其分解为单个令牌(通常是单个单词)。 ES docs

现在让我们描述如何自动完成分析器工作:

  1. 标准令牌生成器是作为标准Elasticsearch令牌生成器提供的令牌(为简化起见,我们说这是单词)
  2. 小写字母过滤器使所有字符都变小。
  3. 然后edge_ngram过滤器将每个单词分解为令牌。

从这里开始魔术:我认为您对1到20的令牌的定义太多了。可能存在包含10个以上字符的单词,但对于我们而言,这是不相关的。同样,仅包含一个对我们不可用的字符的令牌。我更改了:

   "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 5
        }
      }

然后在我们的索引中将包含很多单词部分,长度从2到5个字符。现在,当我们知道要搜索的内容时,就可以创建映射并编写查询:

{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 0,
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 5
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "airport_name": {
          "type": "text",
          "fields": {
            "ngram": {
              "type": "text",
              "analyzer": "autocomplete"
            }
          }
        },
        "airport_code": {
          "type": "keyword",
          "fields": {
            "ngram": {
              "type": "text",
              "analyzer": "autocomplete"
            }
          }
        },
        "city_name": {
          "type": "keyword",
          "fields": {
            "ngram": {
              "type": "text",
              "analyzer": "autocomplete"
            }
          }
        },
        "country_name": {
          "type": "keyword",
          "fields": {
            "ngram": {
              "type": "text",
              "analyzer": "autocomplete"
            }
          }
        }
      }
    }
  }
}

我使用ngram字段和常规字段创建字段,以保持进行聚合的能力。例如,通过多个机场查找城市是很好的选择。

现在我们可以运行一个简单的查询来获取纽约:

{
   "size": 20, 
   "query": {
     "query_string": {
       "default_field": "city_name.ngram",
       "query": "new yo",
       "default_operator": "AND"
     }
   }
}

Answer
{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 13.896059,
    "hits": [
      {
        "_index": "test-index",
        "_type": "_doc",
        "_id": "BtBD2W0BCDulLSY6pKM8",
        "_score": 13.896059,
        "_source": {
          "airport_name": "Flushing",
          "airport_code": "FLU",
          "city_name": "New York",
          "country_name": "United States"
        }
      }
    ]
  }
}

或通过增强功能创建boostingtext查询。在大数据列表中进行查询也将更加有效。

您的查询应为:

{
   "query": {
     "function_score": {
       "query": {
         "query_string": {
           "query": "new yo",
           "analyzer": "autocomplete"
         }
       },
       "functions": [
         {
           "filter": {"terms": {
             "city_name.ngram": [
               "new",
               "yo"
             ]
           }},
           "weight": 2
         },
         {
           "filter": {"terms": {
             "country_name.ngram": [
               "new",
               "yo"
             ]
           }},
           "weight": 2
         }
       ],
       "max_boost": 30,
       "min_score": 5, 
       "score_mode": "max",
       "boost_mode": "multiply"
     }
   }
}

在此查询中,纽约将是第一个,因为我们通过查询部分过滤了所有不相关的文档。并乘以2 city_name.ngram字段分数,在此字段中,我们有2个令牌,那么此字段将获得最高分数。同样,查询的底线是min_score,它过滤而不是相关文档。您可以了解当前的Elasticsearch相关算法here。 顺便说一句,我不想​​在相同权重的函数中放置过滤器。您应该决定是否是更重要的领域。这使您的搜索更加清晰。