我在 ES 查询中使用 Buckets
来减少返回的文档。
这些文档有一个时间戳和一个值,我查询一个时间范围。
文档每 1-5 秒插入一次,每个时间戳都是唯一的。
由于随着时间的推移可能有数十亿个文档,我想通过创建给定时间段内的平均值来减少数据集 = timeIntervalInSeconds
new SearchRequest<DataPoint>(Indices.Parse(ElasticsearchConstants.GetIndexNameFromBatchId(batchId)))
{
Size = 0, // We just need the aggregation data. Returned documents of the top level query are not required.
Query = new BoolQuery
{
Filter = new List<QueryContainer>
{
new DateRangeQuery
{
Field = new Field("Timestamp"),
GreaterThanOrEqualTo = startDateTime.ToString("O", CultureInfo.InvariantCulture),
LessThanOrEqualTo = endDateTime.ToString("O", CultureInfo.InvariantCulture)
}
}
},
Aggregations = new DateHistogramAggregation(ElasticsearchConstants.DataPointsHistogramAggregationKeyString)
{
Field = "Timestamp",
FixedInterval = new Union<DateInterval, Time>(new Time(timeIntervalInSeconds, TimeUnit.Second)),
Offset = ((int)Math.Ceiling(startDateTime.Subtract(new DateTime(1970, 1, 1)).TotalSeconds)).ToString(),
Order = HistogramOrder.KeyAscending,
Aggregations = new ExtendedStatsAggregation("datapoints_date_histogram_stats", new Field("value")),
MinimumDocumentCount = 1 // Just returns buckets which contains documents.
}
};
然后处理来自此搜索的响应:
var searchResponse = await elasticClient.SearchAsync<DataPoint>(searchRequest).ConfigureAwait(false);
var dateHistogram = searchResponse.Aggregations.DateHistogram(ElasticsearchConstants.DataPointsHistogramAggregationKeyString);
return (from item in dateHistogram.Buckets
let extendedStatsAggregate = item.ExtendedStats("datapoints_date_histogram_stats")
where extendedStatsAggregate.Count > 0
let dataPointValue = extendedStatsAggregate.Average.Value
select new DataPoint(
item.Date,
batchId,
parameterId,
dataPointValue,
extendedStatsAggregate.Min ?? dataPointValue,
extendedStatsAggregate.Max ?? dataPointValue,
extendedStatsAggregate.StdDeviation ?? 0.0,
extendedStatsAggregate.Count)).Cast<IDataPoint>()
.ToList();
我选择的 timeIntervalInSeconds
越高,创建的存储桶越少,返回的文档就越少。
只有一件坏事: 返回的最后一个文档时间戳总是 < 给定时间范围内的实际最后一个文档时间戳。 这当然是通过查询设计来实现的,但是有没有办法以某种方式解决这个问题? 最后,我理想地希望有一组减少的文档,其中第一个/最后一个时间戳与实际文档时间戳匹配,并且中间的所有内容都以某种方式被“平均”?