Question

我在列级子查询中偶然发现，让我说我想要一个这样的结果：

来自自联接表的

包含（日期，商店和交易）

我知道使用传统数据仓库（使用列级子查询）可以实现，但我发现hive缺少此功能，所以我创建了我自己的查询：

select main_table.date,main_table.store,main_table.transaction,yest_table.transaction as yesterday_trans, lw_table.transaction as lastweek_trans, lm_table.transaction as lastmonth_trans
    from
    (select date, store, transaction from table where date=current_date)main_table
    left join
    (select date, store, transaction from table where date=date_sub(current_date,1))yest_table
    on date_sub(main_table.date,1)=yest_table.date and main_table.store=yest_table.store
    left join
    (select date, store, transaction from table where date=date_sub(current_date,7))lw_table
    on date_sub(main_table.date,7)=lw_table.date and main_table.store=yest_table.store
    left join
    (select date, store, transaction from table where date=date_sub(current_date,7))lm_table
    on add_months(current_date,-1)=lm_table.date and main_table.store=yest_table.store

是对的吗？因为我认为可能有更好的解决方案..

谢谢

Answer 1

使用case + max()汇总：

select main.date,main.store,main.transaction,s.yesterday_trans,s.lastweek_trans,s.lastmonth_trans
    from
    (select date, store, transaction from table where date=current_date)main
    left join
    (select store, 
       max(case when date = date_sub(current_date,1)    then transaction end) yesterday_trans,  
       max(case when date = date_sub(current_date,7)    then transaction end) lastweek_trans,
       max(case when date = add_months(current_date,-1) then transaction end) lastmonth_trans
       from table 
      where date>=add_months(current_date,-1) and date<=date_sub(current_date,1)
      group by store
    ) s on main.store=s.store;

以这种方式，您将消除两个不必要的表扫描和连接。此解决方案仅适用于current_date（或固定参数而不是current_date）。如果您想从主表中选择许多日期，那么按日期+存储三个连接的解决方案将最有效。

嗯，很可能，LAG也是适用的解决方案......

select date,store,transaction,
    case when lag(date,1) over(partition by store order by date) = date_sub(date,1)) --check if LAG(1) is yesterday (previous date)
         then lag(transaction ,1) over(partition by store order by date) = date_sub(current_date,1)) 
    end as yesterday_trans 
...
--where date>=add_months(current_date,-1) and date<=date_sub(current_date,1)

必要时添加聚合。如果具有LAG的解决方案适用，那么它将是最快的，因为根本不需要连接并且在单次扫描中完成所有操作。如果每个日期有很多记录，那么可能您可以在LAG之前预先聚合它们。这不仅适用于current_date

配置单元 - 列级子查询解决方法

1 个答案: