Hive SQL-如何折叠具有连续日期范围的记录?

时间:2019-02-25 18:30:58

标签: hive

我如何创建逻辑以将具有连续日期范围的多个记录合并为一行

以下示例数据

Member_key  start_date end_date
1            1/1/2017   1/31/2017
1            2/1/2017   2/28/2017
1            3/1/2017   3/31/2017
2            1/1/2017   1/31/2017
2            3/1/2017   3/31/2017

最终将返回以下结果集

1            1/1/2017   3/31/2017
2            1/1/2017   1/31/2017
2            3/1/2017   3/31/2017

我发现以下链接非常有帮助,我确定我的方向正确,但是在尝试将代码转换为Hive sql时遇到错误

http://betteratoracle.com/posts/35-collapsing-continuous-ranges-into-single-rows

这是我被卡住的地方(下面的第二行至最后一行-在.....

中按我的max(grp)中的顺序排列)
with data as(
select 
member_key, 
case 
         when datediff(start_date, lag(end_date) over (partition by member_key order by start_date asc)) <= 1 then
           null
         else
           row_number() over ()
         end grp,

start_date, 
end_date
from default.eligibility_span_test
order by member_key, start_date
)
select member_key, start_date, end_date
, max(grp) over (order by member_key, start_date) sequence
from data

这是我用来向测试表添加数据的插入语句:

insert into default.eligibility_span_test values (1, '2017-01-01','2017-01-31');
insert into default.eligibility_span_test values (1, '2017-02-01', '2017-02-28');
insert into default.eligibility_span_test values (1, '2017-03-01', '2017-03-31');
insert into default.eligibility_span_test values (2, '2017-01-01', '2017-01-31');
insert into default.eligibility_span_test values (2, '2017-03-01', '2017-03-31');

1 个答案:

答案 0 :(得分:0)

您可以尝试以下查询吗?

with eligibility_span_test as
(
select 1 as Member_key, from_unixtime(unix_timestamp('2017-01-01', 'yyyy-MM-dd'), 'yyyy-MM-dd') as start_date, from_unixtime(unix_timestamp('2017-01-31', 'yyyy-MM-dd'), 'yyyy-MM-dd') end_date
union
select 1 as Member_key, from_unixtime(unix_timestamp('2017-02-01', 'yyyy-MM-dd'), 'yyyy-MM-dd') as start_date, from_unixtime(unix_timestamp('2017-02-28', 'yyyy-MM-dd'), 'yyyy-MM-dd') end_date
union
select 1 as Member_key, from_unixtime(unix_timestamp('2017-03-01', 'yyyy-MM-dd'), 'yyyy-MM-dd') as start_date, from_unixtime(unix_timestamp('2017-03-31', 'yyyy-MM-dd'), 'yyyy-MM-dd') end_date
union
select 2 as Member_key, from_unixtime(unix_timestamp('2017-01-01', 'yyyy-MM-dd'), 'yyyy-MM-dd') as start_date, from_unixtime(unix_timestamp('2017-01-31', 'yyyy-MM-dd'), 'yyyy-MM-dd') end_date
union
select 2 as Member_key, from_unixtime(unix_timestamp('2017-03-01', 'yyyy-MM-dd'), 'yyyy-MM-dd') as start_date, from_unixtime(unix_timestamp('2017-03-31', 'yyyy-MM-dd'), 'yyyy-MM-dd') end_date
),
res as (select member_key, month(start_date) - row_number() over (partition by member_key order by start_date) as groupBy, start_date, end_date from eligibility_span_test)
select member_key, min(start_date), min(end_date) from res group by groupBy, member_key;

上面的查询将获取那些我们没有连续的开始和结束日期的memberId,如果我们有连续的日期则获取一个memberId