elasticsearch group-by multiple fields

时间:2013-12-25 17:03:54

标签: group-by elasticsearch

我正在寻找在elasticsearch中分组数据的最佳方法。 Elasticsearch不支持像sql中的'group by'这样的东西。

假设我有1k类别和数百万种产品。您认为渲染完整类别树的最佳方式是什么? of couse jou需要一些元数据(图标,链接目标,seo-titles,...)和类别的自定义排序。

  1. 使用聚合: 示例:https://found.no/play/gist/8124563 如果你必须按一个字段分组,并且需要一些额外的字段,那么看起来很有用。

  2. 在构面中使用多个字段(不会工作) 示例:https://found.no/play/gist/1aa44e2114975384a7c2 在这里,我们失去了不同领域之间的关系。

  3. 构建有趣的方面 https://found.no/play/gist/8124810

  4. 例如使用这3个“解决方案”构建类别树很糟糕。 解决方案1可能有效(ES 1现在不稳定) 解决方案2不起作用 解决方案3是痛苦的,因为它感觉很难看,你需要准备大量的数据并且方面爆炸。

    可能替代方案可能不是在ES中存储任何类别数据,只是id https://found.no/play/gist/a53e46c91e2bf077f2e1

    比从另一个系统获得关联的类别,比如redis,memcache或数据库。

    这会以干净的代码结束,但性能可能会成为一个问题。 例如从memcache / Redis /数据库加载1k类别可能很慢。 另一个问题是同步2个数据库比同步数据库更难。

    你如何处理这些问题?

    我很抱歉链接,但我不能在一篇文章中发布超过2个。

4 个答案:

答案 0 :(得分:23)

聚合API允许使用子聚合按多个字段进行分组。假设您要按字段field1field2field3进行分组:

{
  "aggs": {
    "agg1": {
      "terms": {
        "field": "field1"
      },
      "aggs": {
        "agg2": {
          "terms": {
            "field": "field2"
          },
          "aggs": {
            "agg3": {
              "terms": {
                "field": "field3"
              }
            }
          }          
        }
      }
    }
  }
}

当然,这可以用于你喜欢的任何领域。

<强>更新
为了完整性,以下是上述查询的输出的外观。下面是用于生成聚合查询并将结果展平为字典列表的python代码。

{
  "aggregations": {
    "agg1": {
      "buckets": [{
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": {
          "buckets": [{
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            },
            {
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            }, ...
          ]
        },
        {
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": {
          "buckets": [{
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            },
            {
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            }, ...
          ]
        }, ...
      ]
    }
  }
}

以下python代码在给定字段列表的情况下执行group-by。我指定include_missing=True,它还包含缺少某些字段的值组合(如果您拥有this的Elasticsearch 2.0版,则不需要它)

def group_by(es, fields, include_missing):
    current_level_terms = {'terms': {'field': fields[0]}}
    agg_spec = {fields[0]: current_level_terms}

    if include_missing:
        current_level_missing = {'missing': {'field': fields[0]}}
        agg_spec[fields[0] + '_missing'] = current_level_missing

    for field in fields[1:]:
        next_level_terms = {'terms': {'field': field}}
        current_level_terms['aggs'] = {
            field: next_level_terms,
        }

        if include_missing:
            next_level_missing = {'missing': {'field': field}}
            current_level_terms['aggs'][field + '_missing'] = next_level_missing
            current_level_missing['aggs'] = {
                field: next_level_terms,
                field + '_missing': next_level_missing,
            }
            current_level_missing = next_level_missing

        current_level_terms = next_level_terms

    agg_result = es.search(body={'aggs': agg_spec})['aggregations']
    return get_docs_from_agg_result(agg_result, fields, include_missing)


def get_docs_from_agg_result(agg_result, fields, include_missing):
    current_field = fields[0]
    buckets = agg_result[current_field]['buckets']
    if include_missing:
        buckets.append(agg_result[(current_field + '_missing')])

    if len(fields) == 1:
        return [
            {
                current_field: bucket.get('key'),
                'doc_count': bucket['doc_count'],
            }
            for bucket in buckets if bucket['doc_count'] > 0
        ]

    result = []
    for bucket in buckets:
        records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
        value = bucket.get('key')
        for record in records:
            record[current_field] = value
        result.extend(records)

    return result

答案 1 :(得分:4)

我认为一些开发人员肯定会在Spring DATA ES和JAVA ES API中看到相同的实现。

请找到: -

   <html>

<div id="controlPanel">
  <h1 id="stopButton" class="button">Stop</h1>
  <h1 id="slowButton" class="button">Slow</h1>
  <h1 id="goButton" class="button">Go</h1>
  <h1 id="Lights" class="button">Clear</h1>
  <h1 id="autoLights" class="button">Auto</h1>
</div>

<div id="traffic-light">
  <div id="stopLight" class="bulb"></div>
  <div id="slowLight" class="bulb"></div>
  <div id="goLight" class="bulb"></div>   
</div>



</html>

需要进行相同的导入: -

List<FieldObject> fieldObjectList = Lists.newArrayList();
    SearchQuery aSearchQuery = new NativeSearchQueryBuilder().withQuery(matchAllQuery()).withIndices(indexName).withTypes(type)
            .addAggregation(
                    terms("ByField1").field("field1").subAggregation(AggregationBuilders.terms("ByField2").field("field2")
                            .subAggregation(AggregationBuilders.terms("ByField3").field("field3")))
                    )
            .build();
    Aggregations aField1Aggregations = elasticsearchTemplate.query(aSearchQuery, new ResultsExtractor<Aggregations>() {
        @Override
        public Aggregations extract(SearchResponse aResponse) {
            return aResponse.getAggregations();
        }
    });
    Terms aField1Terms = aField1Aggregations.get("ByField1");
    aField1Terms.getBuckets().stream().forEach(aField1Bucket -> {
        String field1Value = aField1Bucket.getKey();
        Terms aField2Terms = aField1Bucket.getAggregations().get("ByField2");

        aField2Terms.getBuckets().stream().forEach(aField2Bucket -> {
            String field2Value = aField2Bucket.getKey();
            Terms aField3Terms = aField2Bucket.getAggregations().get("ByField3");

            aField3Terms.getBuckets().stream().forEach(aField3Bucket -> {
                String field3Value = aField3Bucket.getKey();
                Long count = aField3Bucket.getDocCount();

                FieldObject fieldObject = new FieldObject();
                fieldObject.setField1(field1Value);
                fieldObject.setField2(field2Value);
                fieldObject.setField3(field3Value);
                fieldObject.setCount(count);
                fieldObjectList.add(fieldObject);
            });
        });
    });

答案 2 :(得分:2)

您可以按以下方式使用综合聚合查询。如果存储桶数超出ES的正常值,则此类型的查询也会分页结果。通过使用“之后”字段,您可以访问其余的存储桶:

"aggs": {
    "my_buckets": {
      "composite": {
        "sources": [
          {
            "field1": {
              "terms": {
                "field": "field1"
              }
            }
          },
          {
            "field2": {
              "terms": {
                "field": "field2"
              }
            }
          },
         {
            "field3": {
              "terms": {
                "field": "field3"
              }
            }
          },
        ]
      }
    }
  }

您可以在ES页面bucket-composite-aggregation中找到更多详细信息。

答案 3 :(得分:1)

子聚合是你需要的......虽然这在文档中从未明确说明,但structuring aggregations可以隐含地找到它。

这将导致子聚合,就像查询被更高聚合的结果过滤一样。 实际上看起来好像这就是那里发生的事情。

{
"aggregations": {
    "VALUE1AGG": {
      "terms": {
        "field": "VALUE1",
      },
      "aggregations": {
        "VALUE2AGG": {
           "terms": {
             "field": "VALUE2",
          }
        }
      }
    }
  }
}