我在列级子查询中偶然发现,让我说我想要一个这样的结果:
来自自联接表的包含(日期,商店和交易)
我知道使用传统数据仓库(使用列级子查询)可以实现,但我发现hive缺少此功能,所以我创建了我自己的查询:
select main_table.date,main_table.store,main_table.transaction,yest_table.transaction as yesterday_trans, lw_table.transaction as lastweek_trans, lm_table.transaction as lastmonth_trans
from
(select date, store, transaction from table where date=current_date)main_table
left join
(select date, store, transaction from table where date=date_sub(current_date,1))yest_table
on date_sub(main_table.date,1)=yest_table.date and main_table.store=yest_table.store
left join
(select date, store, transaction from table where date=date_sub(current_date,7))lw_table
on date_sub(main_table.date,7)=lw_table.date and main_table.store=yest_table.store
left join
(select date, store, transaction from table where date=date_sub(current_date,7))lm_table
on add_months(current_date,-1)=lm_table.date and main_table.store=yest_table.store
是对的吗?因为我认为可能有更好的解决方案..
谢谢
答案 0 :(得分:1)
使用case
+ max()
汇总:
select main.date,main.store,main.transaction,s.yesterday_trans,s.lastweek_trans,s.lastmonth_trans
from
(select date, store, transaction from table where date=current_date)main
left join
(select store,
max(case when date = date_sub(current_date,1) then transaction end) yesterday_trans,
max(case when date = date_sub(current_date,7) then transaction end) lastweek_trans,
max(case when date = add_months(current_date,-1) then transaction end) lastmonth_trans
from table
where date>=add_months(current_date,-1) and date<=date_sub(current_date,1)
group by store
) s on main.store=s.store;
以这种方式,您将消除两个不必要的表扫描和连接。 此解决方案仅适用于current_date(或固定参数而不是current_date)。如果您想从主表中选择许多日期,那么按日期+存储三个连接的解决方案将最有效。
嗯,很可能,LAG也是适用的解决方案......
select date,store,transaction,
case when lag(date,1) over(partition by store order by date) = date_sub(date,1)) --check if LAG(1) is yesterday (previous date)
then lag(transaction ,1) over(partition by store order by date) = date_sub(current_date,1))
end as yesterday_trans
...
--where date>=add_months(current_date,-1) and date<=date_sub(current_date,1)
必要时添加聚合。如果具有LAG的解决方案适用,那么它将是最快的,因为根本不需要连接并且在单次扫描中完成所有操作。如果每个日期有很多记录,那么可能您可以在LAG之前预先聚合它们。这不仅适用于current_date