Question

我在 ES 查询中使用 Buckets 来减少返回的文档。这些文档有一个时间戳和一个值，我查询一个时间范围。文档每 1-5 秒插入一次，每个时间戳都是唯一的。

由于随着时间的推移可能有数十亿个文档，我想通过创建给定时间段内的平均值来减少数据集 = timeIntervalInSeconds

new SearchRequest<DataPoint>(Indices.Parse(ElasticsearchConstants.GetIndexNameFromBatchId(batchId)))
                       {
                           Size = 0, // We just need the aggregation data. Returned documents of the top level query are not required.
                           Query = new BoolQuery
                                       {
                                           Filter = new List<QueryContainer>
                                                        {
                                                            new DateRangeQuery
                                                                {
                                                                    Field = new Field("Timestamp"),
                                                                    GreaterThanOrEqualTo = startDateTime.ToString("O", CultureInfo.InvariantCulture),
                                                                    LessThanOrEqualTo = endDateTime.ToString("O", CultureInfo.InvariantCulture)
                                                                }
                                                        }
                                       },
                           Aggregations = new DateHistogramAggregation(ElasticsearchConstants.DataPointsHistogramAggregationKeyString)
                                              {
                                                  Field = "Timestamp",
                                                  FixedInterval = new Union<DateInterval, Time>(new Time(timeIntervalInSeconds, TimeUnit.Second)),
                                                  Offset = ((int)Math.Ceiling(startDateTime.Subtract(new DateTime(1970, 1, 1)).TotalSeconds)).ToString(),
                                                  Order = HistogramOrder.KeyAscending,
                                                  Aggregations = new ExtendedStatsAggregation("datapoints_date_histogram_stats", new Field("value")),
                                                  MinimumDocumentCount = 1 // Just returns buckets which contains documents.
                                              }
                       };

然后处理来自此搜索的响应：

    var searchResponse = await elasticClient.SearchAsync<DataPoint>(searchRequest).ConfigureAwait(false);   

var dateHistogram = searchResponse.Aggregations.DateHistogram(ElasticsearchConstants.DataPointsHistogramAggregationKeyString);

            return (from item in dateHistogram.Buckets
                    let extendedStatsAggregate = item.ExtendedStats("datapoints_date_histogram_stats")
                    where extendedStatsAggregate.Count > 0

                    let dataPointValue = extendedStatsAggregate.Average.Value
                    select new DataPoint(
                        item.Date,
                        batchId,
                        parameterId,
                        dataPointValue,
                        extendedStatsAggregate.Min ?? dataPointValue,
                        extendedStatsAggregate.Max ?? dataPointValue,
                        extendedStatsAggregate.StdDeviation ?? 0.0,
                        extendedStatsAggregate.Count)).Cast<IDataPoint>()
                .ToList();

我选择的 timeIntervalInSeconds 越高，创建的存储桶越少，返回的文档就越少。

只有一件坏事：返回的最后一个文档时间戳总是 < 给定时间范围内的实际最后一个文档时间戳。这当然是通过查询设计来实现的，但是有没有办法以某种方式解决这个问题？最后，我理想地希望有一组减少的文档，其中第一个/最后一个时间戳与实际文档时间戳匹配，并且中间的所有内容都以某种方式被“平均”？

使用 Buckets 减少查询中的文档

0 个答案: