以下是我正在处理的数据库中的示例:
ID Issue Sub Issue Creation Time Solved Time
1 A A1 01-05-2015 00:10:10 01-05-2015 10:20:00
2 B B1 01-05-2015 00:10:55 01-05-2015 10:30:30
3 A A2 01-05-2015 00:11:30 02-05-2015 08:10:45
4 A A1 01-05-2015 00:14:45 01-05-2015 10:25:00
5 D D4 02-05-2015 13:10:00 NULL
6 B B1 02-05-2015 00:14:35 NULL
我想识别具有相同问题的ID,子项和创建时间< = 5分钟作为重复的ID并消除它们。虽然消除了,如果两者都有一个已解决的时间戳或没有一个已解决的时间戳,我可以选择其中一个。否则,我选择一个具有Solved Timestamp值的那个。
Ex:1& 4,2& 6是此示例中的重复ID。我删除1和6
有人可以帮我处理Hive / SQL查询。
答案 0 :(得分:0)
2和6不重复,因为有不同的日期,时差超过23小时。
我在您的示例数据上测试了此解决方案,它运行正常:
select id, issue, sub_issue, creation_time, solved_time
from
( --calculate is_duplicate_flag for all rows
select case when ((unix_timestamp(next_creation_time,'dd-MM-yyyy hh:mm:ss')-unix_timestamp(creation_time,'dd-MM-yyyy hh:mm:ss'))/60 <=5) or
((unix_timestamp(creation_time,'dd-MM-yyyy hh:mm:ss')-unix_timestamp(prev_creation_time,'dd-MM-yyyy hh:mm:ss'))/60 <=5)
then true else false end as is_duplicate_flag,
s.*
from
(select
t.*,
lead(t.creation_time)
over(partition by t.issue, t.sub_issue order by unix_timestamp(t.creation_time,'dd-MM-yyyy hh:mm:ss')) as next_creation_time,
lag(t.creation_time)
over(partition by t.issue, t.sub_issue order by unix_timestamp(t.creation_time,'dd-MM-yyyy hh:mm:ss') ) as prev_creation_time,
row_number() over(partition by t.issue, t.sub_issue order by case when t.solved_time is not null then 1 else 2 end, unix_timestamp(t.creation_time,'dd-MM-yyyy hh:mm:ss') desc) as rn
from
(select 1 as id, 'A' as issue, 'A1' as sub_issue, '01-05-2015 00:10:10' as creation_time, '01-05-2015 10:20:00' as solved_time from default.dual union all
select 2 as id, 'B' as issue, 'B1' as sub_issue, '01-05-2015 00:10:55' as creation_time, '01-05-2015 10:30:30' as solved_time from default.dual union all
select 3 as id, 'A' as issue, 'A2' as sub_issue, '01-05-2015 00:11:30' as creation_time, '02-05-2015 08:10:45' as solved_time from default.dual union all
select 4 as id, 'A' as issue, 'A1' as sub_issue, '01-05-2015 00:14:45' as creation_time, '01-05-2015 10:25:00' as solved_time from default.dual union all
select 5 as id, 'D' as issue, 'D4' as sub_issue, '02-05-2015 13:10:00' as creation_time, NULL as solved_time from default.dual union all
select 6 as id, 'B' as issue, 'B1' as sub_issue, '02-05-2015 00:14:35' as creation_time, NULL as solved_time from default.dual
)t
) s
)s
where case when ! is_duplicate_flag then 1 else rn end =1
order by id
结果:
id issue sub_issue creation_time solved_time
2 B B1 01-05-2015 00:10:55 01-05-2015 10:30:30
3 A A2 01-05-2015 00:11:30 02-05-2015 08:10:45
4 A A1 01-05-2015 00:14:45 01-05-2015 10:25:00
5 D D4 02-05-2015 13:10:00 NULL
6 B B1 02-05-2015 00:14:35 NULL