计算查询中的所有字词

时间:2018-08-10 14:45:20

标签: elasticsearch

是否有可能为查询中的每个术语找到计数?

例如。我有几句话要计数:

(age == 20 || age == 30) && gender == 'male'

我想使用一次rest调用返回所有条件的总计数+子计数。

预期计数结果:

  1. age == 20
  2. age == 30
  3. age == 20 || age == 30
  4. gender == 'male'
  5. (age == 20 || age == 30) && gender == 'male'

为此特定情况构建的示例搜索查询:

{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "age": { "value": 20,"boost": 1 } // count 1
                }
              },
              {
                "term": {
                  "age": { "value": 30,"boost": 1 } // count 2
                }
              }
            ],
            "adjust_pure_negative": true, "boost": 1
          } // count 3
        },
        {
          "term": {
            "gender.keyword": { "value": "male", "boost": 1 } // count 4
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    } // count 5
  }
}

1 个答案:

答案 0 :(得分:1)

已更新以计算任意条件

根据您的评论,如果您的目标是能够计算结果集中的任意条件,则可以使用Filters Aggregation。通过让您定义使用查询来定义聚合结果中每个存储区的计数,从而可以实现此目的。这要求您为要捕获的每种可能的组合编写查询。如果您需要找出所有组合,那么最好返回单个存储桶计数,并像下面的原始解决方案一样自己做数学。对于您的情况,它看起来像这样:

{
  "aggs": {
    "conditions": {
      "filters": {
        "filters": {
          "age == 20": {"term": {"age": 20}},
          "age == 30": {"term": {"age": 30}},
          "age == 20 || age == 30": {
            "bool": {
              "should": [
                {"term": {"age": 20}},
                {"term": {"age": 30}}
              ]
            }
          },
          "gender == male": {"term": {"gender.keyword": "male"}},
          "(age == 20 || age == 30) && gender == 'male'": {
            "bool": {
              "must": [
                {"term": {"gender.keyword": "male"}}
              ],
              "should": [
                {"term": {"age": 20}},
                {"term": {"age": 30}}
              ]
            }
          }
        }
      }
    }
  }
}

给出您的结果:

{
  "aggregations": {
    "conditions": {
      "buckets": {
        "(age == 20 || age == 30) && gender == 'male'": {
          "doc_count": 12
        },
        "age == 20": {
          "doc_count": 8
        },
        "age == 20 || age == 30": {
          "doc_count": 19
        },
        "age == 30": {
          "doc_count": 11
        },
        "gender == male": {
          "doc_count": 12
        }
      }
    }
  }
}

编辑:原始答案未正确处理(A || B)

您要查找的功能称为“聚合”,特别是Terms Aggregation。字词汇总将计算结果集中与查询子句匹配的字段的每个可能值的文档数。您也可以嵌套聚合。因此,在下面的示例中,Elasticearch将查找与您的查询匹配的所有文档,然后计算与每个年龄段匹配的文档数量(20、30等),然后为每个年龄段计数与每种性别匹配的文档数量。然后,您可以进行数学运算以计算所需的不同组合。

您的查询如下所示:

{
  "query": {
    ...
  },
  "aggs": {
    "age": {
      "terms": {"field": "age"},
      "aggs": {
        "gender": {
          "terms": {"field": "gender"}
        }
      }
    },
    "gender_total": {"terms": {"field": "gender"}}
  }
}

结果看起来像这样:

{
  "hits": { ... },
  "aggregations": {
    "gender_total": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "male",
          "doc_count": 12
        },
        {
          "key": "female",
          "doc_count": 7
        }
      ]
    },
    "age": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 30,
          "doc_count": 11,
          "gender": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "male",
                "doc_count": 9
              },
              {
                "key": "female",
                "doc_count": 2
              }
            ]
          }
        },
        {
          "key": 20,
          "doc_count": 8,
          "gender": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "female",
                "doc_count": 5
              },
              {
                "key": "male",
                "doc_count": 3
              }
            ]
          }
        }
      ]
    }
  }
}

例如,要计算(age == 20 || age == 30) && gender == 'male'的计数,您可以执行类似以下python psuedo-code的操作:

# Pull out the bucket objects for each aggregation
age_buckets = result['aggregations']['age']['buckets']
gender_buckets = result['aggregations']['gender_total']['buckets']

# Get the bucket values we care about
age_20 = [b for b in age_buckets if b['key'] == 20][0]
age_30 = [b for b in age_buckets if b['key'] == 30][0]
male = [b for b in gender_buckets if b['key'] == 'male'][0]

# Get the sub-buckets
age_20_male = [b for b in age_20['gender']['buckets'] if b['key'] == 'male'][0]
age_30_male = [b for b in age_30['gender']['buckets'] if b['key'] == 'male'][0]

# age == 20
count_1 = age_20['doc_count']

# age == 30
count_2 = age_30['doc_count']

# age == 20 || age == 30
count_3 = count_1 + count_2

# gender == 'male'
count_4 = male['doc_count']

# (age == 20 || age == 30) && gender == 'male'
count = age_20_male['doc_count'] + age_30_male['doc_count']