Question

我们对使用高基数指数感兴趣。（已知这是弹性搜索的问题）

我们已经知道你了

select count(distinct high_cardinality_field) from my_table

你已经有了一些优化来计算它。总有一天会写下这样的东西：

select count_via_hyperloglog(high_cardinality_field) from my_table

将count_via_hyperloglog作为UDF或其他东西，因为现在可以通过ES-plugins在ES中使用？

Answer 1

在crate中，此功能在我们的积压工作中作为使用hyperlog算法的附加聚合函数。我们计划进行从presto http://prestodb.io/docs/current/functions/aggregate.html派生的命名。那么您的示例可能如下所示：

select approx_distinct(high_cardinality_field) from my_table

但是，对于每个表的一个特定字段，可能的性能提升是基于高基数字段对表进行聚类，如https://crate.io/docs/current/sql/ddl.html#routing

中所述

Answer 2

使用HyperLogLog进行高基数计划计划为1.1.0，文档已经开启：http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

示例：

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "field" : "author"
            }
        }
    }
}

对于像UDF这样的东西，你可以使用scripts，。e.g。将filter aggregation与script filter

组合在一起

{
    "aggs": {
        "in_stock_products": {
            "filter": {
                "script": {
                    "script": "doc['price'].value > minPrice"
                    "params": {
                        "minPrice": 5
                    }
                }
            },
            "aggs": {
                "avg_price": {
                    "avg": {
                        "field": "price"
                    }
                }
            }
        }
    }
}

路线图中的UDF或概率数据结构

2 个答案: