Hive根据日期范围汇总表中的数据

时间:2019-05-16 21:03:26

标签: sql date datetime hive hql

具有一个具有以下架构设计的表,并且其中包含的数据如下:

ID HITS MISS DDATE
1 10    3    20180101
1 33    21   20180122
1 84    11   20180901
1 11    2    20180405
1 54    23   20190203
1 33    43   20190102
4 54    22   20170305
4 56    88   20180115
5 87    22   20180809
5 66    48   20180617
5 91    53   20170606

DataTypes:

ID    INT
HITS  INT 
MISS  INT
DDATE STRING

要求是每年计算给定(HITSMISS的总数,即201720182019 ...

编写以下查询:

SELECT ID,
SUM(HITS) AS HITS,SUM(MISS) AS MISS,
CASE 
    WHEN DDATE BETWEEN '201701' AND '201712' THEN '2017' ELSE 
    'NOTHING' END AS TTL_YR17_DATA
CASE 
    WHEN DDATE BETWEEN '201801' AND '201812' THEN '2018' ELSE 
    'NOTHING' END AS TTL_YR18_DATA
CASE 
    WHEN DDATE BETWEEN '201901' AND '201912' THEN '2019' ELSE 
    'NOTHING' END AS TTL_YR19_DATA
FROM 
     HST_TABLE
WHERE 
     DDATE BETWEEN '201801' AND '201812'
GROUP BY 
     ID,DDATE; 

但是,查询未获取预期结果。

Actual O/P:

1 10   3    2018
1 33   21   2018
1 84   11   2018
1 11   2    2018
1 54   23   2019
1 33   43   2019
4 54   22   2017
4 56   88   2018
5 87   22   2018
5 66   48   2018
5 91   53   2017

Expected O/P:

1   138 37  2018
4   56  88  2018
5   153 70  2018
1   87  66  2019
5   91  53  2017

Another related question:

有没有一种方法可以避免在查询中传递 DDATE 范围?因为这应该由用户给出,并且不应进行硬编码。

任何满足以上两个要求的帮助/建议都会很有帮助。

2 个答案:

答案 0 :(得分:3)

好的,使用HIVE中的substring函数很容易实现这一点,如下所示:

select
    substring(dddate,0,4) as the_year,
    id,
    sum(hits) as hits_num,
    sum(miss) as miss_num
from
    hst_table
group by
    substring(dddate,0,4),
    id
order by
    the_year,
    id

答案 1 :(得分:1)

@ Shawn.X的上述答案是正确的,但存在逻辑缺陷。下面是更正的一个:

select
    substring(ddate,0,4) as the_year,
    id,
    sum(hits) as hits_num,
    sum(miss) as miss_num
from
    hst_table
group by
    substring(ddate,0,4),
    id
order by
    the_year,
    id;