背景:我通过索引标记化名称(name
字段)以及三元组分析名称({{1}),对名称字段实施了部分搜索} field。
我已经提升了ngram
字段,以确保令牌匹配在命中率的顶部冒泡。
问题:我正在尝试实现一个查询,将nGram匹配限制为仅匹配查询字符串的某个阈值(例如80%)的匹配。我理解name
似乎是我正在寻找的,但我的问题是形成查询以实际产生这些结果。
我的确切令牌匹配被提升到顶部但我仍然得到在minimum_should_match
字段中具有单个匹配的trigram的每个文档。
GIST: Index settings and mapping
索引设置
ngram
索引映射
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "5",
"max_result_window": "30000",
"creation_date": "1475853851937",
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": "3",
"max_gram": "3"
}
},
"analyzer": {
"ngram_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "AuCjcP5sSb-m59bYrprFcw",
"version": {
"created": "2030599"
}
}
}
}
}
解决方案尝试
由于2个链接限制, [ GIST:查询尝试]取消链接:(
{
"my_index": {
"mappings": {
"my_type": {
"properties": {
"acw": {
"type": "integer"
},
"pcg": {
"type": "integer"
},
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"dob": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"id": {
"type": "string"
},
"name": {
"type": "string",
"boost": 10
},
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"bdk": {
"type": "integer"
},
"mmw": {
"type": "integer"
},
"mpi": {
"type": "integer"
},
"sex": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
我尝试了一个多匹配查询,它给了我正确的搜索结果,但我没有运气省略只匹配单个三元组的名称的结果(例如" odo "内部的三元组" odo philus ")
(https://gist.github.com/jordancardwell/2e690013666e7e1da6ef1acee314b4e6)
//this matches 'frodo' and sends results to the top, since `name` field is boosted
// but also matches 'theodore' and 'rodolpho'
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields"
}
}
}
我尝试过玩游戏,手动生成匹配查询,以便我只将//I then tried to throw in the `minimum_must_match` option
// hoping it would filter out large strings that only had one matching trigram for instance
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields",
"minimum_should_match": "90%",
}
}
}
应用于minimum_must_match
字段,但似乎无法得到正确的语法。
ngram
任何人都可以看到我做错了吗?
看起来这应该是相当简单的,但我必须错过一些明显的东西。
我使用// I then tried to contruct a custom query to just return the `minimum_should_match`d results on the ngram field
// I started with a query produced by using bodybuilder to `and` and `or` my other search criteria together
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
//each separate field's criteria `must`/`and`ed together
{
"query": {
"bool": {
"filter": {
"bool": {
"should": [
//each critereon for a specific field `should`/`or`ed together
{
//my attempt at getting `ngram` field results..
// should theoretically only return when field
// contains nothing but matching ngrams
// (i.e. exact matches and other fluke matches)
"query": {
"match": {
"ngram": {
"query": "frodo",
"minimum_should_match": "100%"
}
}
}
}
//... other critereon to be `should`/`or`ed together
]
}
}
}
}
}
//... other criteria to be `must`/`and`ed together
]
}
}
}
}
}
(使用感知用户界面)运行查询以尝试了解我的结果。
我在_explain=true
match
字段ngram
上查询了"frod"
minimum_should_match
= 100%
,但仍然可以获得至少匹配的每条记录NGRAM。
(例如rodolpho
,即使它不包含fro
)
注意:从[discuss.elastic.co] 交叉发布 稍后会发布一个链接,但不能发布超过2个:/
(https://discuss.elastic.co/t/ngram-partial-match-limiting-ngram-results-in-multiple-field-query/62526)
答案 0 :(得分:1)
我使用您的设置和映射来创建索引。你的查询似乎对我来说很好。我建议在其中一个"意外"上做一个explain
。正在返回的文档,并查看为何与其匹配并返回其他结果。
这是我做的:
在您的分析器上运行analyze api,以查看查询将如何拆分为令牌。
curl -XGET 'localhost:9200/my_index/_analyze' -d '
{
"analyzer" : "ngram_analyzer",
"text" : "frodo"
}'
frodo将与您的分析仪分成3个令牌。
{
"tokens": [
{
"token": "fro",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "rod",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "odo",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
我索引3个文件进行测试(仅使用ngrams字段)。以下是文档:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"ngram": "theodore"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"ngram": "frodo"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"ngram": "rudolpho"
}
}
]
}
}
你提到的第一个问题,它与frodo和theodore匹配,但不像你提到的那样rudolpho - 这是有道理的,因为rudolpho不产生任何与frodo的三元组相匹配的三元组
frodo -> fro, rod, odo
rudolpho -> rud, udo, dol, olp, lph, pho
使用你的第二个查询,我只回到frodo(其他两个都没有)。
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.53148466,
"_source": {
"ngram": "frodo"
}
}
]
}
}
然后我在其他两个文档(theodore和rudolpho)上运行了解释(localhost:9200/my_index/my_type/2/_explain
),我看到了这个(我已经剪切了回复)
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause ((ngram:fro ngram:rod ngram:odo)~2)",
"details": [
以上是预期的,因为来自佛罗多的三个令牌中至少有两个应该匹配。