我正在尝试向elasticsearch发出一个查询,该查询按过滤,分组,汇总和排序。我有两个问题:查询应该如何以及对弹性搜索的性能影响是什么?
让我举一个示例数据集来支持我的问题。假设我有一套销售:
document type: 'sales' with the following fields and data:
sale_datetime | sold_product | sold_at_price
-----------------|---------------|--------------
2015-11-24 12:00 | some product | 100
2015-11-24 12:30 | some product | 100
2015-11-24 12:30 | other product | 100
2015-11-24 13:00 | other product | 100
2015-11-24 12:30 | some product | 200
2015-11-24 13:00 | some product | 200
我想发出一个查询:
将其应用于上面的示例数据集,将返回以下结果:
sold_product | sum of sold_at_price
--------------|--------------
some product | 300 // takes into account rows 2 and 5
other product | 100 // takes into account row 3
如果可以发出这样的查询,对弹性搜索有哪些重要的性能影响?如果需要考虑:
提前感谢您的帮助!
答案 0 :(得分:1)
这是aggregations的典型用例。让我们从创建索引和建模数据映射开始。我们有一个正常的date
field for sold_datetime
,另一个numeric field for sold_at_price
和一个multi-field of type string for sold_product
。您会注意到此多字段有一个名为raw
的子字段not_analyzed
,将用于在产品名称上创建聚合:
curl -XPUT localhost:9200/sales -d '{
"mappings": {
"sale": {
"properties": {
"sale_datetime": {
"type": "date"
},
"sold_product": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"sold_at_price": {
"type": "double"
}
}
}
}
}'
现在让我们使用新索引的_bulk
端点索引您的示例数据集:
curl -XPOST localhost:9200/sales/sale/_bulk -d '
{"index": {}}
{"sold_datetime": "2015-11-24T12:00:00.000Z", "sold_product":"some product", "sold_at_price": 100}
{"index": {}}
{"sold_datetime": "2015-11-24T12:30:00.000Z", "sold_product":"some product", "sold_at_price": 100}
{"index": {}}
{"sold_datetime": "2015-11-24T12:30:00.000Z", "sold_product":"other product", "sold_at_price": 100}
{"index": {}}
{"sold_datetime": "2015-11-24T13:00:00.000Z", "sold_product":"other product", "sold_at_price": 100}
{"index": {}}
{"sold_datetime": "2015-11-24T12:30:00.000Z", "sold_product":"some product", "sold_at_price": 200}
{"index": {}}
{"sold_datetime": "2015-11-24T13:00:00.000Z", "sold_product":"some product", "sold_at_price": 200}
'
最后,让我们创建您需要的查询和聚合:
curl -XPOST localhost:9200/sales/sale/_search -d '{
"size": 0,
"query": {
"filtered": {
"filter": {
"range": {
"sold_datetime": {
"gt": "2015-11-24T12:15:00.000Z",
"lt": "2015-11-24T12:45:00.000Z"
}
}
}
}
},
"aggs": {
"sold_products": {
"terms": {
"field": "sold_product.raw",
"order": {
"total": "desc"
}
},
"aggs": {
"total": {
"sum": {
"field": "sold_at_price"
}
}
}
}
}
}'
如您所见,我们正在过滤sold_datetime
字段的特定日期间隔(11月24日12:15-12:45)。聚合部分在sold_product.raw
字段上定义terms
aggregation,对于每个存储桶,我们sum
sold_at_price
字段的值。
请注意,如果您有数百万个可能匹配的文档,为了使其具有良好的性能,您需要先应用最具侵略性的过滤器,可能是您正在运行查询的业务的ID,或其他一些在运行聚合之前将排除尽可能多的文档的标准。
结果如下:
{
...
"aggregations" : {
"sold_products" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "some product",
"doc_count" : 2,
"total" : {
"value" : 300.0
}
}, {
"key" : "other product",
"doc_count" : 1,
"total" : {
"value" : 100.0
}
} ]
}
}
}