我正在寻找在elasticsearch中分组数据的最佳方法。 Elasticsearch不支持像sql中的'group by'这样的东西。
假设我有1k类别和数百万种产品。您认为渲染完整类别树的最佳方式是什么? of couse jou需要一些元数据(图标,链接目标,seo-titles,...)和类别的自定义排序。
使用聚合: 示例:https://found.no/play/gist/8124563 如果你必须按一个字段分组,并且需要一些额外的字段,那么看起来很有用。
在构面中使用多个字段(不会工作) 示例:https://found.no/play/gist/1aa44e2114975384a7c2 在这里,我们失去了不同领域之间的关系。
例如使用这3个“解决方案”构建类别树很糟糕。 解决方案1可能有效(ES 1现在不稳定) 解决方案2不起作用 解决方案3是痛苦的,因为它感觉很难看,你需要准备大量的数据并且方面爆炸。
可能替代方案可能不是在ES中存储任何类别数据,只是id https://found.no/play/gist/a53e46c91e2bf077f2e1
比从另一个系统获得关联的类别,比如redis,memcache或数据库。
这会以干净的代码结束,但性能可能会成为一个问题。 例如从memcache / Redis /数据库加载1k类别可能很慢。 另一个问题是同步2个数据库比同步数据库更难。
你如何处理这些问题?
我很抱歉链接,但我不能在一篇文章中发布超过2个。
答案 0 :(得分:23)
聚合API允许使用子聚合按多个字段进行分组。假设您要按字段field1
,field2
和field3
进行分组:
{
"aggs": {
"agg1": {
"terms": {
"field": "field1"
},
"aggs": {
"agg2": {
"terms": {
"field": "field2"
},
"aggs": {
"agg3": {
"terms": {
"field": "field3"
}
}
}
}
}
}
}
}
当然,这可以用于你喜欢的任何领域。
<强>更新强>
为了完整性,以下是上述查询的输出的外观。下面是用于生成聚合查询并将结果展平为字典列表的python代码。
{
"aggregations": {
"agg1": {
"buckets": [{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
}, ...
]
}
}
}
以下python代码在给定字段列表的情况下执行group-by。我指定include_missing=True
,它还包含缺少某些字段的值组合(如果您拥有this的Elasticsearch 2.0版,则不需要它)
def group_by(es, fields, include_missing):
current_level_terms = {'terms': {'field': fields[0]}}
agg_spec = {fields[0]: current_level_terms}
if include_missing:
current_level_missing = {'missing': {'field': fields[0]}}
agg_spec[fields[0] + '_missing'] = current_level_missing
for field in fields[1:]:
next_level_terms = {'terms': {'field': field}}
current_level_terms['aggs'] = {
field: next_level_terms,
}
if include_missing:
next_level_missing = {'missing': {'field': field}}
current_level_terms['aggs'][field + '_missing'] = next_level_missing
current_level_missing['aggs'] = {
field: next_level_terms,
field + '_missing': next_level_missing,
}
current_level_missing = next_level_missing
current_level_terms = next_level_terms
agg_result = es.search(body={'aggs': agg_spec})['aggregations']
return get_docs_from_agg_result(agg_result, fields, include_missing)
def get_docs_from_agg_result(agg_result, fields, include_missing):
current_field = fields[0]
buckets = agg_result[current_field]['buckets']
if include_missing:
buckets.append(agg_result[(current_field + '_missing')])
if len(fields) == 1:
return [
{
current_field: bucket.get('key'),
'doc_count': bucket['doc_count'],
}
for bucket in buckets if bucket['doc_count'] > 0
]
result = []
for bucket in buckets:
records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
value = bucket.get('key')
for record in records:
record[current_field] = value
result.extend(records)
return result
答案 1 :(得分:4)
我认为一些开发人员肯定会在Spring DATA ES和JAVA ES API中看到相同的实现。
请找到: -
<html>
<div id="controlPanel">
<h1 id="stopButton" class="button">Stop</h1>
<h1 id="slowButton" class="button">Slow</h1>
<h1 id="goButton" class="button">Go</h1>
<h1 id="Lights" class="button">Clear</h1>
<h1 id="autoLights" class="button">Auto</h1>
</div>
<div id="traffic-light">
<div id="stopLight" class="bulb"></div>
<div id="slowLight" class="bulb"></div>
<div id="goLight" class="bulb"></div>
</div>
</html>
需要进行相同的导入: -
List<FieldObject> fieldObjectList = Lists.newArrayList();
SearchQuery aSearchQuery = new NativeSearchQueryBuilder().withQuery(matchAllQuery()).withIndices(indexName).withTypes(type)
.addAggregation(
terms("ByField1").field("field1").subAggregation(AggregationBuilders.terms("ByField2").field("field2")
.subAggregation(AggregationBuilders.terms("ByField3").field("field3")))
)
.build();
Aggregations aField1Aggregations = elasticsearchTemplate.query(aSearchQuery, new ResultsExtractor<Aggregations>() {
@Override
public Aggregations extract(SearchResponse aResponse) {
return aResponse.getAggregations();
}
});
Terms aField1Terms = aField1Aggregations.get("ByField1");
aField1Terms.getBuckets().stream().forEach(aField1Bucket -> {
String field1Value = aField1Bucket.getKey();
Terms aField2Terms = aField1Bucket.getAggregations().get("ByField2");
aField2Terms.getBuckets().stream().forEach(aField2Bucket -> {
String field2Value = aField2Bucket.getKey();
Terms aField3Terms = aField2Bucket.getAggregations().get("ByField3");
aField3Terms.getBuckets().stream().forEach(aField3Bucket -> {
String field3Value = aField3Bucket.getKey();
Long count = aField3Bucket.getDocCount();
FieldObject fieldObject = new FieldObject();
fieldObject.setField1(field1Value);
fieldObject.setField2(field2Value);
fieldObject.setField3(field3Value);
fieldObject.setCount(count);
fieldObjectList.add(fieldObject);
});
});
});
答案 2 :(得分:2)
您可以按以下方式使用综合聚合查询。如果存储桶数超出ES的正常值,则此类型的查询也会分页结果。通过使用“之后”字段,您可以访问其余的存储桶:
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"field1": {
"terms": {
"field": "field1"
}
}
},
{
"field2": {
"terms": {
"field": "field2"
}
}
},
{
"field3": {
"terms": {
"field": "field3"
}
}
},
]
}
}
}
您可以在ES页面bucket-composite-aggregation中找到更多详细信息。
答案 3 :(得分:1)
子聚合是你需要的......虽然这在文档中从未明确说明,但structuring aggregations可以隐含地找到它。
这将导致子聚合,就像查询被更高聚合的结果过滤一样。 实际上看起来好像这就是那里发生的事情。
{
"aggregations": {
"VALUE1AGG": {
"terms": {
"field": "VALUE1",
},
"aggregations": {
"VALUE2AGG": {
"terms": {
"field": "VALUE2",
}
}
}
}
}
}