Question

我有一组结构如下的数据：

[uid, product,   currency,  platform,  date]
[100, product_1, USA,       desktop,   2019-01-01]
[100, product_2, USA,       desktop,   2019-01-03]
[200, product_3, CAN,       mobile,    2019-01-02]
[300, product_1, GBP,       desktop,   2019-01-01]
and so on...

数据必须每年汇总一次：

[year, product,   currency, platform,  uid_count]
[2019, product_1, USA,      desktop,   1000]
[2019, product_2, USA,      desktop,   2000]
[2019, product_3, GBP,      mobile,    5000]

研究了解决方案后，我读到了关于素描算法的信息，这似乎是正确的方向。本质上，数据太大而无法批量加载，因此，例如，我需要每天进行增量处理，以使我不运行诸如以下的SQL查询：

SELECT year(date), product, currency, platform, count(distinct uid) FROM tbl_name GROUP BY 1, 2, 3, 4

OR

SELECT year(date), product, currency, platform, count(distinct uid) FROM tbl_name GROUP BY 1, 2, 3, 4
with cube

Answer 1

不幸的是，count(distinct uid)不能累加，您需要重新重申整个年度数据集，您无法计算出不同的一天并将其添加到现有的累积年度计算中。因为如果在许多不同的日子中都存在相同的UID，则第一天的count(distinct uid) +第二天的count(distinct uid)不等于这两天计算出的count(distinct uid)。这使得count（distinct）无法扩展。

但是如果可以应用估算，您可能可以基于草图算法进行一些近似估算。

可供Hive使用的草图算法的实现很少。

此Hive的HyperLogLog：HllHiveUDFs Sketches library from Yahoo
Brickhouse sketch UDFs-“ K个最小值”草绘算法。
另一个实现：https://github.com/MLnick/hive-udf/wiki

如何使用每日数据构建年度数据？

1 个答案: