获取存储桶中所有文档的总数

时间:2017-12-08 12:14:50

标签: elasticsearch

当我通过以下聚合搜索时:

"aggregations": {
"codes": {
  "terms": {
    "field": "code"
  },
  "aggs": {
    "dates": {
      "date_range": {
        "field": "created_time",
        "ranges": [
          {
            "from": "2017-12-06T00:00:00.000",
            "to": "2017-12-06T16:00:00.000"
          },
          {
            "from": "2017-12-07T00:00:00.000",
            "to": "2017-12-07T23:59:59.999"
          }
        ]
      }
    }
  }
}
}

我得到以下结果:

"aggregations": {
"codes": {
  "buckets": [
    {
      "key": "123456",
      "doc_count": 104005499,
      "dates": {
        "buckets": [
          {
            "key": "2017-12-05T20:00:00.000Z-2017-12-06T12:00:00.000Z",
            "from_as_string": "2017-12-05T20:00:00.000Z",
            "to_as_string": "2017-12-06T12:00:00.000Z",
            "doc_count": 156643
          },
          {
            "key": "2017-12-06T20:00:00.000Z-2017-12-07T19:59:59.999Z",
            "from_as_string": "2017-12-06T20:00:00.000Z",
            "to_as_string": "2017-12-07T19:59:59.999Z",
            "doc_count": 11874
          }
        ]
      }
    },
    ...
  ]
 }
}

所以现在我有一个桶的列表。我需要为每个桶提供一个总计数值,这是内部桶的doc_counts的总和。例如,我的第一个桶的总数应为156643 + 11874 = 168517。 我尝试过使用Sub Bucket聚合,但是

 "totalcount": {
      "sum_bucket": {
        "buckets_path": "dates"
      }
    }

这不起作用,因为"buckets_path must reference either a number value or a single value numeric metric aggregation, got: org.elasticsearch.search.aggregations.bucket.range.date.InternalDateRange.Bucket"。任何想法我该怎么做?

1 个答案:

答案 0 :(得分:0)

看起来这是一个已知问题。在弹性论坛上有一个discussion,在那里我找到了解决它的黑客(感谢Ruslan_Didyk,作者,顺便说一句):

POST my_aggs/my_doc/_search
{
  "size": 0,
  "aggregations": {
    "codes": {
      "terms": {
        "field": "code"
      },
      "aggs": {
        "dates": {
          "date_range": {
            "field": "created_time",
            "ranges": [
              {
                "from": "2017-12-06T00:00:00.000",
                "to": "2017-12-06T16:00:00.000"
              },
              {
                "from": "2017-12-07T00:00:00.000",
                "to": "2017-12-07T23:59:59.999"
              }
            ]
          },
          "aggs": {
            "my_cnt": {
              "value_count": {
                "field": "created_time"
              }
            }
          }
        },
        "totalcount": {
          "stats_bucket": {
            "buckets_path": "dates>my_cnt"
          }
        }
      }
    }
  }
}

您不能只生成totalcount的原因是因为date_range隐式创建子存储桶并且管道聚合无法处理它(我会说这是Elasticsearch的错误)。< / p>

所以黑客就是将另一个子聚合添加到datesmy_cnt,它只计算存储桶中的文档数量。 (请注意,我在created_time字段上使用了value_count聚合,假设它存在于所有文档中并且只有一个值。)

给出这样的文件集:

{"code":"1234","created_time":"2017-12-06T01:00:00"}
{"code":"1234","created_time":"2017-12-06T17:00:00"}
{"code":"1234","created_time":"2017-12-07T01:00:00"}
{"code":"1234","created_time":"2017-12-06T02:00:00"}
{"code":"1235","created_time":"2017-12-07T18:00:00"}
{"code":"1234","created_time":"2017-12-07T18:00:00"}

汇总的结果将是:

  "aggregations": {
    "codes": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "1234",
          "doc_count": 5,
          "dates": {
            "buckets": [
              {
                "key": "2017-12-06T00:00:00.000Z-2017-12-06T16:00:00.000Z",
                "from": 1512518400000,
                "from_as_string": "2017-12-06T00:00:00.000Z",
                "to": 1512576000000,
                "to_as_string": "2017-12-06T16:00:00.000Z",
                "doc_count": 2,
                "my_cnt": {
                  "value": 2
                }
              },
              {
                "key": "2017-12-07T00:00:00.000Z-2017-12-07T23:59:59.999Z",
                "from": 1512604800000,
                "from_as_string": "2017-12-07T00:00:00.000Z",
                "to": 1512691199999,
                "to_as_string": "2017-12-07T23:59:59.999Z",
                "doc_count": 2,
                "my_cnt": {
                  "value": 2
                }
              }
            ]
          },
          "totalcount": {
            "count": 2,
            "min": 2,
            "max": 2,
            "avg": 2,
            "sum": 4
          }
        },
        {
          "key": "1235",
          "doc_count": 1,
          "dates": {
            "buckets": [
              {
                "key": "2017-12-06T00:00:00.000Z-2017-12-06T16:00:00.000Z",
                "from": 1512518400000,
                "from_as_string": "2017-12-06T00:00:00.000Z",
                "to": 1512576000000,
                "to_as_string": "2017-12-06T16:00:00.000Z",
                "doc_count": 0,
                "my_cnt": {
                  "value": 0
                }
              },
              {
                "key": "2017-12-07T00:00:00.000Z-2017-12-07T23:59:59.999Z",
                "from": 1512604800000,
                "from_as_string": "2017-12-07T00:00:00.000Z",
                "to": 1512691199999,
                "to_as_string": "2017-12-07T23:59:59.999Z",
                "doc_count": 1,
                "my_cnt": {
                  "value": 1
                }
              }
            ]
          },
          "totalcount": {
            "count": 1,
            "min": 1,
            "max": 1,
            "avg": 1,
            "sum": 1
          }
        }
      ]
    }
  }

所需的值低于totalcount.sum

一些注意事项

正如我已经说过的,这仅在假设created_time is always present and is exactly one成立时才有效。如果在不同的情况下,date_range聚合下的字段将具有多个值(例如update_time以指示文档的所有更新),则sum将不再等于匹配文档的实际数量(如果这些日期重叠)。

在这种情况下,您可以随时使用filter聚合并使用range查询。

希望有所帮助!