Question

对于每个搜索请求，我都允许标签列表。例如，

["search", "open_source", "freeware", "linux"]

我想要检索此列表中包含所有标签的文档。我想要检索：

{
    "tags": ["search", "freeware"]
}

并排除

{
    "tags": ["search", "windows"]
}

因为列表不包含windows标记。

在Elasticsearch文档中有一个完全等于的例子：

https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_multiple_exact_values.html

首先，我们包含一个维护标签数量的字段：

{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }

其次，我们使用需要的tag_count

进行检索

GET /my_index/my_type/_search
{
    "query": {
        "filtered" : {
            "filter" : {
                 "bool" : {
                    "must" : [
                        { "term" : { "tags" : "search" } }, 
                        { "term" : { "tags" : "open_source" } }, 
                        { "term" : { "tag_count" : 2 } } 
                    ]
                }
            }
        }
    }
}

问题是我不知道tag_count。

此外，我尝试使用script_field tags_count编写查询，在术语查询中编写每个允许的标记，并将minimal_should_match设置为tags_count，但我无法设置脚本minimal_should_match中的变量。

我可以调查什么？

Answer 1

所以我承认这不是一个很好的解决方案，但它可能会激发其他更好的解决方案吗？

鉴于您搜索的部分记录与您在帖子中的tag_count字段相似：

"tags" : ["search"],
"tag_count" : 1

或

"tags" : ["search", "open_source"],
"tag_count" : 2

你有一个类似的查询：

["search", "open_source", "freeware"]

然后，您可以以编程方式生成如下查询：

{
    "query" : {
        "bool" : {
            "should" : [
                {
                    "bool" : {
                        "should" : [
                            { "term" : { "tags" : "search" } },
                            { "term" : { "tags" : "open_source" } },
                            { "term" : { "tags" : "freeware" } },
                            { "term" : { "tag_count" : 1 } },
                        ],
                        "minimum_should_match" : 2
                    }
                },
                {
                    "bool" : {
                        "should" : [
                            { "term" : { "tags" : "search" } },
                            { "term" : { "tags" : "open_source" } },
                            { "term" : { "tags" : "freeware" } },
                            { "term" : { "tag_count" : 2 } },
                        ],
                        "minimum_should_match" : 3
                    }
                },
                {
                    "bool" : {
                        "should" : [
                            { "term" : { "tags" : "search" } },
                            { "term" : { "tags" : "open_source" } },
                            { "term" : { "tags" : "freeware" } },
                            { "term" : { "tag_count" : 3 } },
                        ],
                        "minimum_should_match" : 4
                    }
                }
            ],
            "minimum_should_match" : 1
        }
    }
}

嵌套bool查询的数量将与查询标记的数量查询匹配（由于多种原因而不是很好 - 但是使用较小的查询/较小的索引，可能会侥幸逃脱？）。基本上每个子句都将处理tag_count的每个可能情况，minimum_should_match将是tag_count + 1（因此匹配tag_count和适当的标签数量 - tag_count数量）。

Answer 2

如果索引大小为中等大小且标签基数相当低，我只会使用terms聚合来获取不同的标记，并创建must和must not过滤器来过滤掉包含标记的文档你不允许＆＃34;允许＆＃34;。有很多方法可以将所有标记的列表缓存到像Redis这样的内存数据库中，以下是我想到的一些：

有几分钟或几小时的生存时间，如果缓存已过期则重新生成列表
让后台处理定期刷新列表
插入新文档时更新列表，然后也应处理文档删除

更高性能和100％准确的方法可能如下所示：

查询所有包含所请求标签的文档，但排除包含已知其他标签的文档（与第一个解决方案一样）
浏览返回的文档列表
如果某个文档包含的标记不是＆＃34;允许＆＃34;，则表示它不在已知的标记缓存中，因此必须在其中添加，从结果集中排除此文档
Redis上的标签可以有一个TTL，例如一天或一周，这样就可以自动修剪旧标签并获得更简单的ES查询

通过这种方式，您不需要备份过程来维护标记列表或使用可能很重的terms聚合，因为它会访问所有文档，并始终获得正确的结果集和相当高效的查询。

如果使用后续聚合，这将无法工作，因为ES可能会返回在客户端修剪的虚假文档。但是，这可以通过添加terms聚合来检测，并确认它没有意外的标记。如果需要将它们添加到标记缓存中，则添加到must_not过滤器并且必须重新执行查询。如果经常创建新标签，这并不理想。

Answer 3

为什么不使用bool并将windows添加到must must子句中。我希望这是你在寻找的东西。

Answer 4

@Sergey Shuvalov，另一种不使用脚本来逃避这种情况的方法是创建另一个字段，其值包含用逗号分隔的所有排序标签（例如，或者您可以选择适合您的任何分隔符）。

例如，如果您有这样的文档：

{
    "tags": ["search", "open_source", "freeware", "linux"]
}

您要创建另一个字段alltags，其中包含相同的标记，但按字典顺序排序并用逗号分隔，如下所示：

{
  "tags": ["search", "open_source", "freeware", "linux"]
  "alltags": "freeware,linux,open_source,search"
}

新的alltags字段为not_analyzed，因此具有以下映射：

{
  "mappings": {
    "doc": {
      "properties": {
        "tags": {
          "type": "string"
        },
        "alltags": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}

然后您可以发出一个简单的term查询，如下所示，您只需确保标记也已排序，您就可以获得匹配的文档。

{
  "query": {
    "term": {
      "alltags": "freeware,linux,open_source,search"
    }
  }
}

如果您有很长的标签列表，您可能还决定从已排序的标签列表中生成MD5或SHA1，并仅将该值存储在alltags字段中，并在搜索过程中使用相同的值。最重要的是你需要制作某种＆＃34;签名＆＃34;对于您的标记列表，并且知道在给定相同标记集的情况下该签名将始终相同。天空极限！

Answer 5

正如我早些时候提到的，我结合了两个不错的答案。这就是我所拥有的：

"query" : {
    "bool":{
        "should":[
            {"term":{"tag_count":1}},
            {
                "bool":{
                    "should":[
                        {"term":{"tags":"search"}},
                        {"term":{"tags":"open_source"}},
                        {"term":{"tags":"freeware"}}
                    ],
                    "filter":{"term":{"tag_count":2}},
                    "minimum_should_match":2
                }
            },
            {
                "bool":{
                    "should":[
                        {"term":{"tags":"search"}},
                        {"term":{"tags":"open_source"}},
                        {"term":{"tags":"freeware"}}
                    ],
                    "filter":{"term":{"tag_count":3}},
                    "minimum_should_match":3
                }
            },
            {
                "script": {
                    "script": "tags.containsAll(doc['tags'].values)",
                    "params": {"tags":["search", "open_source", "freeware"]}
                }
            }
        ],
        "filter":{ "terms" : {"tags" :["search", "open_source", "freeware"]}}
    }
}

脚本条件适用于非常重要的情况，其他条件是考虑简单的情况。

检索仅包含允许标记的文档（完全等于）

5 个答案: