我想在过去3个月内每周根据以下查询中指定的条件汇总帐户数。在列表中以num_of_accounts和周为单位获取此数据的最有效方法是什么。
select COUNT(DISTINCT a.account_id) as num_accounts,
WEEKOFYEAR(a.ds) as week
FROM
(SELECT
CAST(account_id as BIGINT)
FROM
tableA
WHERE ds='2013-12-28') a
JOIN
tableB b
ON a.account_id=b.account_id AND
b.ds='2013-12-28'
WHERE
b.invoice_date between '2013-12-22' AND '2013-12-28' AND
b.payment_status = 'failed' AND b.payment_status = 'unbilled'
答案 0 :(得分:1)
你试图在一大组中做一个独特的计数。一种可扩展的方法是使用概率数据结构,如超级日志或KMV草图集,如Brickhouse(http://github.com/klout/brickhouse)中提供的那些。有一篇博客文章描述了与你http://brickhouseconfessions.wordpress.com/2013/12/11/using-sketch_set-for-reach-estimation/一样的情况。这应该给你一个相当接近的估计,而不必完全依靠你的数据。
如果我理解正确,你只想按星期聚合,你有一个Hive UDF WEEKOFYEAR
从一个日期字符串返回一周。只需使用Brickhouse的sketch_set
UDAF
SELECT WEEKOFYEAR( ds), estimated_reach( sketch_set( account_id ) ) as num_account_est
FROM myquery
GROUP BY WEEKOFYEAR( ds);
其中myquery是表示您在上面表达的业务逻辑的视图。