nGram部分匹配&限制nGram会导致多个字段查询

时间:2016-10-07 19:46:21

标签: search elasticsearch indexing n-gram elasticsearch-2.0

背景:我通过索引标记化名称(name字段)以及三元组分析名称({{1}),对名称字段实施了部分搜索} field。

我已经提升了ngram字段,以确保令牌匹配在命中率的顶部冒泡。

问题:我正在尝试实现一个查询,将nGram匹配限制为仅匹配查询字符串的某个阈值(例如80%)的匹配。我理解name似乎是我正在寻找的,但我的问题是形成查询以实际产生这些结果。

我的确切令牌匹配被提升到顶部但我仍然得到minimum_should_match字段中具有单个匹配的trigram的每个文档。

GIST: Index settings and mapping

索引设置

ngram

索引映射

{
  "my_index": {
    "settings": {
      "index": {
        "number_of_shards": "5",
        "max_result_window": "30000",
        "creation_date": "1475853851937",
        "analysis": {
          "filter": {
            "ngram_filter": {
              "type": "ngram",
              "min_gram": "3",
              "max_gram": "3"
            }
          },
          "analyzer": {
            "ngram_analyzer": {
              "filter": [
                "lowercase",
                "ngram_filter"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "AuCjcP5sSb-m59bYrprFcw",
        "version": {
          "created": "2030599"
        }
      }
    }
  }
}

解决方案尝试

由于2个链接限制,

[ GIST:查询尝试]取消链接:( { "my_index": { "mappings": { "my_type": { "properties": { "acw": { "type": "integer" }, "pcg": { "type": "integer" }, "date": { "type": "date", "format": "strict_date_optional_time||epoch_millis" }, "dob": { "type": "date", "format": "strict_date_optional_time||epoch_millis" }, "id": { "type": "string" }, "name": { "type": "string", "boost": 10 }, "ngram": { "type": "string", "analyzer": "ngram_analyzer" }, "bdk": { "type": "integer" }, "mmw": { "type": "integer" }, "mpi": { "type": "integer" }, "sex": { "type": "string", "index": "not_analyzed" } } } } } }

我尝试了一个多匹配查询,它给了我正确的搜索结果,但我没有运气省略只匹配单个三元组的名称的结果(例如" odo "内部的三元组" odo philus ")

(https://gist.github.com/jordancardwell/2e690013666e7e1da6ef1acee314b4e6)

//this matches 'frodo' and sends results to the top, since `name` field is boosted
//  but also matches 'theodore' and 'rodolpho'

{
  "size":100,
  "from":0,
  "query":{
    "multi_match":{
      "query":"frodo",
      "fields":[
        "name",
        "ngram"
      ],
      "type":"best_fields"
    }
  }
}

我尝试过玩游戏,手动生成匹配查询,以便我只将//I then tried to throw in the `minimum_must_match` option // hoping it would filter out large strings that only had one matching trigram for instance { "size":100, "from":0, "query":{ "multi_match":{ "query":"frodo", "fields":[ "name", "ngram" ], "type":"best_fields", "minimum_should_match": "90%", } } } 应用于minimum_must_match字段,但似乎无法得到正确的语法。

ngram

任何人都可以看到我做错了吗?

看起来这应该是相当简单的,但我必须错过一些明显的东西。

更新

我使用// I then tried to contruct a custom query to just return the `minimum_should_match`d results on the ngram field // I started with a query produced by using bodybuilder to `and` and `or` my other search criteria together { "query": { "bool": { "filter": { "bool": { "must": [ //each separate field's criteria `must`/`and`ed together { "query": { "bool": { "filter": { "bool": { "should": [ //each critereon for a specific field `should`/`or`ed together { //my attempt at getting `ngram` field results.. // should theoretically only return when field // contains nothing but matching ngrams // (i.e. exact matches and other fluke matches) "query": { "match": { "ngram": { "query": "frodo", "minimum_should_match": "100%" } } } } //... other critereon to be `should`/`or`ed together ] } } } } } //... other criteria to be `must`/`and`ed together ] } } } } } (使用感知用户界面)运行查询以尝试了解我的结果。

我在_explain=true match字段ngram上查询了"frod" minimum_should_match = 100%,但仍然可以获得至少匹配的每条记录NGRAM。 (例如rodolpho,即使它不包含fro

GIST: test query and results

注意:从[discuss.elastic.co] 交叉发布 稍后会发布一个链接,但不能发布超过2个:/

(https://discuss.elastic.co/t/ngram-partial-match-limiting-ngram-results-in-multiple-field-query/62526)

1 个答案:

答案 0 :(得分:1)

我使用您的设置和映射来创建索引。你的查询似乎对我来说很好。我建议在其中一个"意外"上做一个explain。正在返回的文档,并查看为何与其匹配并返回其他结果。

这是我做的:

在您的分析器上运行analyze api,以查看查询将如何拆分为令牌。

curl -XGET 'localhost:9200/my_index/_analyze' -d '
{
  "analyzer" : "ngram_analyzer",
  "text" : "frodo"
}'

frodo将与您的分析仪分成3个令牌。

{
  "tokens": [
    {
      "token": "fro",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "rod",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "odo",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    }
  ]
}

我索引3个文件进行测试(仅使用ngrams字段)。以下是文档:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "ngram": "theodore"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "ngram": "frodo"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "ngram": "rudolpho"
        }
      }
    ]
  }
}

你提到的第一个问题,它与frodo和theodore匹配,但不像你提到的那样rudolpho - 这是有道理的,因为rudolpho不产生任何与frodo的三元组相匹配的三元组

frodo -> fro, rod, odo 

rudolpho -> rud, udo, dol, olp, lph, pho

使用你的第二个查询,我只回到frodo(其他两个都没有)。

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.53148466,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.53148466,
        "_source": {
          "ngram": "frodo"
        }
      }
    ]
  }
}

然后我在其他两个文档(theodore和rudolpho)上运行了解释(localhost:9200/my_index/my_type/2/_explain),我看到了这个(我已经剪切了回复)

{
  "_index": "my_index",
  "_type": "my_type",
  "_id": "2",
  "matched": false,
  "explanation": {
    "value": 0,
    "description": "Failure to meet condition(s) of required/prohibited clause(s)",
    "details": [
      {
        "value": 0,
        "description": "no match on required clause ((ngram:fro ngram:rod ngram:odo)~2)",
        "details": [

以上是预期的,因为来自佛罗多的三个令牌中至少有两个应该匹配。