Elasticsearch不会先返回完全匹配

时间:2013-05-16 15:56:25

标签: elasticsearch

我有一个弹性搜索索引,其中包含完全匹配的字段,不知怎的,我得到了很多类似的结果(我不介意)和那些类似的结果在完全匹配之前排序,(我这样做)心。)

有人可以解释发生了什么以及如何解决这个问题吗?

我的映射就像这样

"exact":{
  "type":"string",
  "boost":10.0,
  "analyzer":"keyword"
},

我搜索“AAPL P JAN 2014 885,00”的查询是这样的:

{
  "size" : 21,
  "query" : {
    "field" : {
      "exact" : "AAPL P JAN 2014 885,00"
    }
  },
  "explain" : true,
  "sort" : [ {
    "_score" : {
      "order" : "desc"
    }
  } ],
  "facets" : {
    "category" : {
      "terms" : {
        "field" : "category",
        "size" : 10
      }
    }
  }
}

返回的文件按此顺序结束:

  • {“exact”:[“APPLE INC”,“US0378331005”,“AAPL”,“73773”],“id-compound”:“AAPL”}
  • {“exact”:[“AAPL”,“73773”,“AAPL P JAN 2014 675,00”],“id-compound”:“AAPL * PUT * 20140118 * 675”}
  • {“exact”:[“AAPL”,“73773”,“AAPL C JAN 2014 500,00”],“id-compound”:“AAPL * CALL * 20140118 * 500”}

等,完全匹配了一堆结果。

有人可以向我解释为什么完全匹配不会结束吗?

如果它有助于理解事物,那么完整解释的搜索结果如下。

"hits" : [ {
  "_shard" : 0,
  "_node" : "1",
  "_index" : "instruments",
  "_type" : "instrument",
  "_id" : "AAPL",
  "_score" : 1306.8339, "_source" : {"exact":["APPLE INC","US0378331005","AAPL","73773"],"id-compound":"AAPL"},
  "_explanation" : {
    "value" : 1306.8339,
    "description" : "product of:",
    "details" : [ {
      "value" : 6534.169,
      "description" : "sum of:",
      "details" : [ {
        "value" : 6534.169,
        "description" : "weight(exact:AAPL in 9096), product of:",
        "details" : [ {
          "value" : 0.25854474,
          "description" : "queryWeight(exact:AAPL), product of:",
          "details" : [ {
            "value" : 6.1701355,
            "description" : "idf(docFreq=211, maxDocs=37299)"
          }, {
            "value" : 0.0419026,
            "description" : "queryNorm"
          } ]
        }, {
          "value" : 25272.875,
          "description" : "fieldWeight(exact:AAPL in 9096), product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(termFreq(exact:AAPL)=1)"
          }, {
            "value" : 6.1701355,
            "description" : "idf(docFreq=211, maxDocs=37299)"
          }, {
            "value" : 4096.0,
            "description" : "fieldNorm(field=exact, doc=9096)"
          } ]
        } ]
      } ]
    }, {
      "value" : 0.2,
      "description" : "coord(1/5)"
    } ]
  }
}, {
  "_shard" : 0,
  "_node" : "1",
  "_index" : "instruments",
  "_type" : "instrument",
  "_id" : "AAPL*PUT*20140118*675",
  "_score" : 163.35423, "_source" : {"exact":["AAPL","73773","AAPL P JAN 2014 675,00"],"id-compound":"AAPL*PUT*20140118*675"},
  "_explanation" : {
    "value" : 163.35423,
    "description" : "product of:",
    "details" : [ {
      "value" : 816.7711,
      "description" : "sum of:",
      "details" : [ {
        "value" : 816.7711,
        "description" : "weight(exact:AAPL in 18), product of:",
        "details" : [ {
          "value" : 0.25854474,
          "description" : "queryWeight(exact:AAPL), product of:",
          "details" : [ {
            "value" : 6.1701355,
            "description" : "idf(docFreq=211, maxDocs=37299)"
          }, {
            "value" : 0.0419026,
            "description" : "queryNorm"
          } ]
        }, {
          "value" : 3159.1094,
          "description" : "fieldWeight(exact:AAPL in 18), product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(termFreq(exact:AAPL)=1)"
          }, {
            "value" : 6.1701355,
            "description" : "idf(docFreq=211, maxDocs=37299)"
          }, {
            "value" : 512.0,
            "description" : "fieldNorm(field=exact, doc=18)"
          } ]
        } ]
      } ]
    }, {
      "value" : 0.2,
      "description" : "coord(1/5)"
    } ]
  }
}, {
  "_shard" : 0,
  "_node" : "1",
  "_index" : "instruments",
  "_type" : "instrument",
  "_id" : "AAPL*CALL*20140118*500",
  "_score" : 163.35423, "_source" : {"exact":["AAPL","73773","AAPL C JAN 2014 500,00"],"id-compound":"AAPL*CALL*20140118*500"},
  "_explanation" : {
    "value" : 163.35423,
    "description" : "product of:",
    "details" : [ {
      "value" : 816.7711,
      "description" : "sum of:",
      "details" : [ {
        "value" : 816.7711,
        "description" : "weight(exact:AAPL in 383), product of:",
        "details" : [ {
          "value" : 0.25854474,
          "description" : "queryWeight(exact:AAPL), product of:",
          "details" : [ {
            "value" : 6.1701355,
            "description" : "idf(docFreq=211, maxDocs=37299)"
          }, {
            "value" : 0.0419026,
            "description" : "queryNorm"
          } ]
        }, {
          "value" : 3159.1094,
          "description" : "fieldWeight(exact:AAPL in 383), product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(termFreq(exact:AAPL)=1)"
          }, {
            "value" : 6.1701355,
            "description" : "idf(docFreq=211, maxDocs=37299)"
          }, {
            "value" : 512.0,
            "description" : "fieldNorm(field=exact, doc=383)"
          } ]
        } ]
      } ]
    }, {
      "value" : 0.2,
      "description" : "coord(1/5)"
    } ]
  }
}, {
  "_id" : "AAPL*PUT*20140118*940",
  "_score" : 163.35423, "_source" : {"exact":["AAPL","73773","AAPL P JAN 2014 940,00"],"id-compound":"AAPL*PUT*20140118*940"},
  "_explanation" : {
    "value" : 163.35423,
    "description" : "product of:",
    "details" : [ {
      "value" : 816.7711,
      "description" : "sum of:",
      "details" : [ {
        "value" : 816.7711,
        "description" : "weight(exact:AAPL in 794), product of:",
        "details" : [ {
          "value" : 0.25854474,
          "description" : "queryWeight(exact:AAPL), product of:",
          "details" : [ {
            "value" : 6.1701355,
            "description" : "idf(docFreq=211, maxDocs=37299)"
          }, {
            "value" : 0.0419026,
            "description" : "queryNorm"
          } ]
        }, {
          "value" : 3159.1094,
          "description" : "fieldWeight(exact:AAPL in 794), product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(termFreq(exact:AAPL)=1)"
          }, {
            "value" : 6.1701355,
            "description" : "idf(docFreq=211, maxDocs=37299)"
          }, {
            "value" : 512.0,
            "description" : "fieldNorm(field=exact, doc=794)"
          } ]
        } ]
      } ]
    }, {
      "value" : 0.2,
      "description" : "coord(1/5)"
    } ]
  }
}

如果我分析我想要存储的数据会发生什么:

curl -XGET 'localhost:9200/instruments/_analyze?field=exact&pretty=true' -d 'ING  P JUN 2013 6.00'
{
  "tokens" : [ {
    "token" : "ING  P JUN 2013 6.00",
    "start_offset" : 0,
    "end_offset" : 20,
    "type" : "word",
    "position" : 1
  } ]

5 个答案:

答案 0 :(得分:2)

我不确定它在技术上是否是最好的,但如果您只是在弹性搜索的单个特定答案之后,您可以使用带有查找完全匹配的脚本的过滤器。

{
  from : 0,
  size : 1,
  "query" : { 
    "text_phrase" : { 
      "title" : "AAPL P JAN 2014 885,00"
    } 
  },
  "filter" : { 
    "script" : { 
      "script" : "_source.exact.contains(x)", 
      "params" : { 
        "x" : "AAPL P JAN 2014 885,00" 
      }  
    } 
  }
}

我用它来从弹性搜索中获取一个已知的条目,它对我来说效果很好。

答案 1 :(得分:1)

我认为你已经找到了答案,只是想为其他人提供更多信息以解决同样的问题。

您使用来自elasticsearch文档的field查询:

  

字段查询:

     

针对特定字段执行查询字符串的查询。它是query_string查询的简化版本(通过将default_field设置为此查询执行的字段)。

我相信query_string查询是针对文字的,即:它对查询做了很多工作,使其模糊等等......

您想要使用的内容(我认为您发现了这一点)是一个term查询,它不会对搜索词组执行任何操作,因此只能为您提供完全匹配。

注意:分析发生在2个不同的时间,索引时间和查询时间。设置"analyzer": "keyword"似乎只会影响“使用查询字符串”form elasticsearch docs进行搜索时的搜索时间查询。我必须承认我并不确切地知道这意味着什么(我猜是query_string但它也可能意味着像http://../_search?q=exact:{query here}这样的搜索

答案 2 :(得分:1)

你不应该分析你的id字段。

将您的字段定义为:

"exact":{
   "type":"string",
   "index":"not_analyzed"
 }

查看Finding Exact Values

答案 3 :(得分:0)

所有三个文档都得到完全相同的分数,你可以从他们在“AAPL”上匹配的解释输出中看到。该术语始终在文档中出现一次(tf = 1),并且出现在37299个文档中的211个(idf = 6.1701355)。因为你使用索引时间提升(你的映射中的提升部分,10),所以字段规范要高得多,因为匹配总是在同一个字段上,所以没什么大不了的。只是如果你在其他领域有匹配,那么几乎总能赢,这在你的情况下可能有意义。

但问题是,如果我查看你的文件,AAPL P JAN 2014 885,00并不完全匹配。我所看到的是,在你的查询中的5个术语中只有一个匹配,这在你的解释输出中也被coord确认:coord(1/5)`。

似乎应用了keyword分析器,但正如您从返回的文档中看到的那样,您不是将exact字段的内容作为单个值发送,而是作为值数组发送。由于您使用的是keyword分析器,因此每个项目都不会被标记化,但您仍然有多个令牌。我想你必须检查你是如何索引文件的。

答案 4 :(得分:0)

在搜索查询中似乎忽略了关键字分析器的原因是因为ES将此字符串标记两次 - 首先运行其DSL标记化器然后它运行rezult上maping中指定的标记生成器。本文http://paulsabou.com/blog/2012/03/25/advanced-exact-matching-with-elastic-search/

中对此进行了更详细的说明