我正在研究Hive,并且面临滚动计数的问题。我正在处理的样本数据如下所示:
我期望的输出如下所示:
我尝试使用以下查询,但未返回滚动计数:
select event_dt,status, count(distinct account) from
(select *, row_number() over (partition by account order by event_dt
desc)
as rnum from table.A
where event_dt between '2018-05-02' and '2018-05-04') x where rnum =1
group by event_dt, status;
如果有人解决了类似的问题,请帮助我。
答案 0 :(得分:0)
您似乎只想条件聚合:
select event_dt,
sum(case when status = 'Registered' then 1 else 0 end) as registered,
sum(case when status = 'active_acct' then 1 else 0 end) as active_acct,
sum(case when status = 'suspended' then 1 else 0 end) as suspended,
sum(case when status = 'reactive' then 1 else 0 end) as reactive
from table.A
group by event_dt
order by event_dt;
编辑:
这是一个棘手的问题。我想出的解决方案将日期和用户的乘积乘以,然后计算每个日期的最新状态。
所以:
select a.event_dt,
sum(case when aa.status = 'Registered' then 1 else 0 end) as registered,
sum(case when aa.status = 'active_acct' then 1 else 0 end) as active_acct,
sum(case when aa.status = 'suspended' then 1 else 0 end) as suspended,
sum(case when aa.status = 'reactive' then 1 else 0 end) as reactive
from (select d.event_dt, ac.account, a.status,
max(case when a.status is not null then a.timestamp end) over (partition by ac.account order by d.event_dt) as last_status_timestamp
from (select distinct event_dt from table.A) d cross join
(select distinct account from table.A) ac left join
(select a.*,
row_number() over (partition by account, event_dt order by timestamp desc) as seqnum
from table.A a
) a
on a.event_dt = d.event_dt and
a.account = ac.account and
a.seqnum = 1 -- get the last one on the date
) a left join
table.A aa
on aa.timestamp = a.last_status_timestamp and
aa.account = a.account
group by d.event_dt
order by d.event_dt;
这是在创建一个派生表,其中包含所有帐户和日期的行。此状态在某些日子,但不是全天。
last_status_timestamp
的累积最大值计算具有有效状态的最新时间戳。然后将其重新加入表格以获取该日期的状态。瞧!这是用于条件聚合的状态。
累积最大值和连接数是一种解决方法,因为Hive(尚未?)不支持ignore nulls
中的lag()
选项。