如何过滤热门匹配结果

时间:2020-11-12 12:22:01

标签: elasticsearch aggregation elasticsearch-aggregation

我正在对Elasticsearch中的索引products进行一些查询。我在索引products

中有以下文档
{ "product_name": "prod-1", "meta": [ { "key": "key1", "value": "value1" }, { "key": "key2", "value": "value2" } ] }
{ "product_name": "prod-2", "meta": [ { "key": "key1", "value": "value1" } ] }
{ "product_name": "prod-2", "meta": [ { "key": "key2", "value": "value2" } ] }
{ "product_name": "prod-3", "meta": [ { "key": "key2", "value": "value2" } ] }

我现在想要得到的是在元数组中同时包含product_namekey1/value1但不一定在同一文档中的key2/value2。例如,在上面的数据中,prod-1在同一文档中同时具有key1/value1key2/value2,因此我想要结果prod-1。并且prod-2同时具有key1/value1key2/value2,但它们在不同的文档中。我也想在结果中使用prod-2prod-3仅包含key2/value2,即使在整个文档中也是如此。因此,我不想在结果中得到prod-3

我正在尝试以下方法

  1. 按产品名称分组
  2. 然后过滤汇总结果,以检查每个产品同时具有key1/value1key2/value2

我按product_name对它们进行分组,并按如下方式组合每个存储分区中的元字段

{
  "size": 0,
  "aggs": {
    "by_product": {
      "terms": {
        "field": "product_name"
      },
      "aggs": {
        "all_meta": {
          "top_hits": {
            "_source": {
              "includes": [
                "meta.key",
                "meta.value"
              ]
            }
          }
        }
      }
    }
  }
}

上述汇总后的结果实际上是以下情况

  "aggregations" : {
    "by_product" : {
      ...
      "buckets" : [
        {
          ...
          "key" : "prod-2",
          "all_meta" : {
            "hits" : {
              ...
              "hits" : [
                {
                  ....
                  "_source" : {
                    "meta" : [
                      {
                        "value" : "value1",
                        "key" : "key1"
                      }
                    ]
                  }
                },
                {
                  ....
                  "_source" : {
                    "meta" : [
                      {
                        "value" : "value2",
                        "key" : "key2"
                      }
                    ]
                  }
                }
              ]
            }
          }
        },
        {
          ....
          "key" : "prod-1",
          "all_meta" : {
            "hits" : {
              ....
              "hits" : [
                {
                  ....
                  "_source" : {
                    "meta" : [
                      {
                        "value" : "value1",
                        "key" : "key1"
                      },
                      {
                        "value" : "value2",
                        "key" : "key2"
                      }
                    ]
                  }
                }
              ]
            }
          }
        },
        {
          ....
          "key" : "prod-3",
          "all_meta" : {
            "hits" : {
              ....
              "hits" : [
                {
                  ....
                  "_source" : {
                    "meta" : [
                      {
                        "value" : "value2",
                        "key" : "key2"
                      }
                    ]
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }

现在,我想仅从每个聚集中同时包含{ "key": "key1", "value": "value1" }{ "key": "key2", "value": "value2" }的每个存储桶中过滤值,并获取存储桶。像这样

{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "buckets.all_meta.hits.hits._source.meta",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "buckets.all_meta.hits.hits._source.meta.key": "key1"
                    }
                  },
                  {
                    "match": {
                      "buckets.all_meta.hits.hits._source.meta.value": "value1"
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "nested": {
            "path": "buckets.all_meta.hits.hits._source.meta",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "buckets.all_meta.hits.hits._source.meta.key": "key2"
                    }
                  },
                  {
                    "match": {
                      "buckets.all_meta.hits.hits._source.meta.value": "value2"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

但是我不确定如何执行上述步骤。是否有可能做到这一点? This stackoverflow问题与此类似,但没有任何答案。还有其他方法可以得到我想要的结果吗?任何帮助,将不胜感激。谢谢。

1 个答案:

答案 0 :(得分:1)

这是一个解决方案。这个想法是,在每个产品存储区中,我们聚合所有键/值对(使用脚本化的terms聚合),然后使用bucket_selector管道聚合,我们仅选择具有两个不同的产品存储区对。

POST products/_search
{
  "size": 0,
  "aggs": {
    "by_product": {
      "terms": {
        "field": "product_name.keyword"
      },
      "aggs": {
        "meta": {
          "nested": {
            "path": "meta"
          },
          "aggs": {
            "kv": {
              "terms": {
                "script": """
                [doc['meta.key.keyword'].value, doc['meta.value.keyword'].value].join('-')
                """,
                "size": 10
              }
            }
          }
        },
        "selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "meta>kv._bucket_count"
            },
            "script": "params.count == 2"
          }
        }
      }
    }
  }
}

在结果中,您可以看到我们只有prod-1和prod-2`:

  "buckets" : [
    {
      "key" : "prod-2",
      "doc_count" : 2,
      "meta" : {
        "doc_count" : 2,
        "kv" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "key1-value1",
              "doc_count" : 1
            },
            {
              "key" : "key2-value2",
              "doc_count" : 1
            }
          ]
        }
      }
    },
    {
      "key" : "prod-1",
      "doc_count" : 1,
      "meta" : {
        "doc_count" : 2,
        "kv" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "key1-value1",
              "doc_count" : 1
            },
            {
              "key" : "key2-value2",
              "doc_count" : 1
            }
          ]
        }
      }
    }
  ]