当前日期缺失时的30天滚动/移动总和

时间:2016-04-26 10:36:58

标签: sql hadoop hive window-functions impala

我有一个表(country_id),用于存储给定页面的访问者数量。

referral

我想计算此产品的30天滚动/移动总和,即使是那些缺失的日子。所以最终结果应该如下:

product_id

现在,这是一个简单的表示,因为我有几十个不同的datecountry_idreferral。我无法预先创建一个包含{product_iddatecountry_idreferral}所有可能组合的表格,因为考虑到其大小,这会变得无法治愈桌子。如果特定{product_idview_of_referred_eventslist_of_datesselect t.`date`, t.country_id, t.referral, t.product_id, sum(visitors) over (partition by t.country_id, t.referral, t.product_id order by t.`date` rows between 30 preceding and current row) as cumulative_sum_visitors from ( selec d.`date`, re.country_id, re.referral, re.product_id, sum(visitors) as visitors from list_of_dates d left outer join view_of_referred_events re on d.`date` = re.`date` and re.referral = "pl" and re.product_id = "113759" and re.country_id = "216" group by d.`date`, re.country_id, re.referral, re.product_id ) t order by t.`date` asc; }不存在,我也不希望在决赛桌中排成一行之前。

如果在date country_id referral product_id cumulative_visitors 2016-04-01 216 pl 113759 1 2016-04-02 NULL NULL NULL NULL 2016-04-03 216 pl 113759 2 2016-04-04 NULL NULL NULL NULL 2016-04-05 NULL NULL NULL NULL 2016-04-06 216 pl 113759 15 2016-04-07 216 pl 113759 25 当天没有访问者,我想是否有一种简单的方法可以告诉Impala使用前一行(前一天)的值。

我写了这个查询,其中{{1}}是一张表,其中列出了从4月1日到4月7日的天数。

{{1}}

这会返回类似于我想要的东西,但不完全是那样。

{{1}}

2 个答案:

答案 0 :(得分:0)

我不确定性能会有多糟糕,但你可以通过聚合数据两次并为第二次聚合添加30天并取消计数来实现。

这样的事情:

with t as (
      select d.`date`, re.country_id, re.referral, re.product_id,
             sum(visitors) as visitors
      from list_of_dates d left outer join
           view_of_referred_events re
           on d.`date` = re.`date` and
              re.referral = 'pl' and
              re.product_id = 113759 and
              re.country_id = 216
      group by d.`date`, re.country_id, re.referral, re.product_id
     )
select date, country_id, referral, product_id,
       sum(sum(visitors)) over (partition by country_id, referral, product_id order by date) as visitors
from ((select date, country_id, referral, product_id, visitors
       from t
      ) union all
      (select date_add(date, 30), country_id, referral, product_id, -visitors
       from t
      ) 
     ) tt
group by date, country_id, referral, product_id;

答案 1 :(得分:0)

我添加了另一个子查询来从分区的最后一行获取值。我不确定您使用的是什么版本的hive / impala,last_value(column_name, ignore null values true/false)是语法。

我假设您正在尝试查找30天(月)的累积计数,我建议使用月份字段对行进行分组。可以从维度表list_of_dates或仅substr(date, 1, 7)获得月份,并获得超过..rows unbounded preceding and current row的访问者累计数量。

<强>查询:

select
  `date`,
  country_id,
  referral,
  product_id,
  sum(visitors) over (partition by country_id, referral, product_id order by `date`
                     rows between 30 preceding and current row) as cumulative_sum_visitors 
from (select
  t.`date`,
  -- get the last not null value from the partition window w for country_id, referral & product_id
  last_value(t.country_id, true) over w as country_id,
  last_value(t.referral, true) over w as  referral
  last_value(t.product_id, true) over w as product_id 
  if(visitors = null, 0, visitors) as visitors 
from (
  select
    d.`date`, 
    re.country_id, 
    re.referral, 
    re.product_id,
    sum(visitors) as visitors
  from list_of_dates d
  left outer join view_of_referred_events re on d.`date` = re.`date`
    and re.referral = "pl"
    and re.product_id = "113759"
    and re.country_id = "216"
  group by d.`date`, re.country_id, re.referral, re.product_id
  ) t
window w as (partition by t.country_id, t.referral, t.product_id order by t.`date`
                     rows between unbounded preceding and unbounded following)) t1
order by `date` asc;