我有一个表(country_id
),用于存储给定页面的访问者数量。
referral
我想计算此产品的30天滚动/移动总和,即使是那些缺失的日子。所以最终结果应该如下:
product_id
现在,这是一个简单的表示,因为我有几十个不同的date
,country_id
和referral
。我无法预先创建一个包含{product_id
,date
,country_id
和referral
}所有可能组合的表格,因为考虑到其大小,这会变得无法治愈桌子。如果特定{product_id
,view_of_referred_events
,list_of_dates
和select
t.`date`,
t.country_id,
t.referral,
t.product_id,
sum(visitors) over (partition by t.country_id, t.referral, t.product_id order by t.`date`
rows between 30 preceding and current row) as cumulative_sum_visitors
from (
selec
d.`date`,
re.country_id,
re.referral,
re.product_id,
sum(visitors) as visitors
from list_of_dates d
left outer join view_of_referred_events re on d.`date` = re.`date`
and re.referral = "pl"
and re.product_id = "113759"
and re.country_id = "216"
group by d.`date`, re.country_id, re.referral, re.product_id
) t
order by t.`date` asc;
}不存在,我也不希望在决赛桌中排成一行之前。
如果在date country_id referral product_id cumulative_visitors
2016-04-01 216 pl 113759 1
2016-04-02 NULL NULL NULL NULL
2016-04-03 216 pl 113759 2
2016-04-04 NULL NULL NULL NULL
2016-04-05 NULL NULL NULL NULL
2016-04-06 216 pl 113759 15
2016-04-07 216 pl 113759 25
当天没有访问者,我想是否有一种简单的方法可以告诉Impala使用前一行(前一天)的值。
我写了这个查询,其中{{1}}是一张表,其中列出了从4月1日到4月7日的天数。
{{1}}
这会返回类似于我想要的东西,但不完全是那样。
{{1}}
答案 0 :(得分:0)
我不确定性能会有多糟糕,但你可以通过聚合数据两次并为第二次聚合添加30天并取消计数来实现。
这样的事情:
with t as (
select d.`date`, re.country_id, re.referral, re.product_id,
sum(visitors) as visitors
from list_of_dates d left outer join
view_of_referred_events re
on d.`date` = re.`date` and
re.referral = 'pl' and
re.product_id = 113759 and
re.country_id = 216
group by d.`date`, re.country_id, re.referral, re.product_id
)
select date, country_id, referral, product_id,
sum(sum(visitors)) over (partition by country_id, referral, product_id order by date) as visitors
from ((select date, country_id, referral, product_id, visitors
from t
) union all
(select date_add(date, 30), country_id, referral, product_id, -visitors
from t
)
) tt
group by date, country_id, referral, product_id;
答案 1 :(得分:0)
我添加了另一个子查询来从分区的最后一行获取值。我不确定您使用的是什么版本的hive / impala,last_value(column_name, ignore null values true/false)
是语法。
我假设您正在尝试查找30天(月)的累积计数,我建议使用月份字段对行进行分组。可以从维度表list_of_dates
或仅substr(date, 1, 7)
获得月份,并获得超过..rows unbounded preceding and current row
的访问者累计数量。
<强>查询:强>
select
`date`,
country_id,
referral,
product_id,
sum(visitors) over (partition by country_id, referral, product_id order by `date`
rows between 30 preceding and current row) as cumulative_sum_visitors
from (select
t.`date`,
-- get the last not null value from the partition window w for country_id, referral & product_id
last_value(t.country_id, true) over w as country_id,
last_value(t.referral, true) over w as referral
last_value(t.product_id, true) over w as product_id
if(visitors = null, 0, visitors) as visitors
from (
select
d.`date`,
re.country_id,
re.referral,
re.product_id,
sum(visitors) as visitors
from list_of_dates d
left outer join view_of_referred_events re on d.`date` = re.`date`
and re.referral = "pl"
and re.product_id = "113759"
and re.country_id = "216"
group by d.`date`, re.country_id, re.referral, re.product_id
) t
window w as (partition by t.country_id, t.referral, t.product_id order by t.`date`
rows between unbounded preceding and unbounded following)) t1
order by `date` asc;