Question

以下是我正在处理的数据库中的示例：

ID  Issue   Sub Issue   Creation Time            Solved Time
1   A        A1        01-05-2015 00:10:10       01-05-2015 10:20:00
2   B        B1        01-05-2015 00:10:55       01-05-2015 10:30:30
3   A        A2        01-05-2015 00:11:30       02-05-2015 08:10:45
4   A        A1        01-05-2015 00:14:45       01-05-2015 10:25:00
5   D        D4        02-05-2015 13:10:00          NULL
6   B        B1        02-05-2015 00:14:35          NULL

我想识别具有相同问题的ID，子项和创建时间＆lt; = 5分钟作为重复的ID并消除它们。虽然消除了，如果两者都有一个已解决的时间戳或没有一个已解决的时间戳，我可以选择其中一个。否则，我选择一个具有Solved Timestamp值的那个。

Ex：1＆amp; 4,2＆amp; 6是此示例中的重复ID。我删除1和6

有人可以帮我处理Hive / SQL查询。

Answer 1

2和6不重复，因为有不同的日期，时差超过23小时。

我在您的示例数据上测试了此解决方案，它运行正常：

select id, issue, sub_issue, creation_time, solved_time from ( --calculate is_duplicate_flag for all rows select case when ((unix_timestamp(next_creation_time,'dd-MM-yyyy hh:mm:ss')-unix_timestamp(creation_time,'dd-MM-yyyy hh:mm:ss'))/60 <=5) or ((unix_timestamp(creation_time,'dd-MM-yyyy hh:mm:ss')-unix_timestamp(prev_creation_time,'dd-MM-yyyy hh:mm:ss'))/60 <=5) then true else false end as is_duplicate_flag, s.* from (select t.*, lead(t.creation_time) over(partition by t.issue, t.sub_issue order by unix_timestamp(t.creation_time,'dd-MM-yyyy hh:mm:ss')) as next_creation_time, lag(t.creation_time) over(partition by t.issue, t.sub_issue order by unix_timestamp(t.creation_time,'dd-MM-yyyy hh:mm:ss') ) as prev_creation_time, row_number() over(partition by t.issue, t.sub_issue order by case when t.solved_time is not null then 1 else 2 end, unix_timestamp(t.creation_time,'dd-MM-yyyy hh:mm:ss') desc) as rn from (select 1 as id, 'A' as issue, 'A1' as sub_issue, '01-05-2015 00:10:10' as creation_time, '01-05-2015 10:20:00' as solved_time from default.dual union all select 2 as id, 'B' as issue, 'B1' as sub_issue, '01-05-2015 00:10:55' as creation_time, '01-05-2015 10:30:30' as solved_time from default.dual union all select 3 as id, 'A' as issue, 'A2' as sub_issue, '01-05-2015 00:11:30' as creation_time, '02-05-2015 08:10:45' as solved_time from default.dual union all select 4 as id, 'A' as issue, 'A1' as sub_issue, '01-05-2015 00:14:45' as creation_time, '01-05-2015 10:25:00' as solved_time from default.dual union all select 5 as id, 'D' as issue, 'D4' as sub_issue, '02-05-2015 13:10:00' as creation_time, NULL as solved_time from default.dual union all select 6 as id, 'B' as issue, 'B1' as sub_issue, '02-05-2015 00:14:35' as creation_time, NULL as solved_time from default.dual )t ) s )s where case when ! is_duplicate_flag then 1 else rn end =1 order by id

结果：

id issue sub_issue creation_time solved_time 2 B B1 01-05-2015 00:10:55 01-05-2015 10:30:30 3 A A2 01-05-2015 00:11:30 02-05-2015 08:10:45 4 A A1 01-05-2015 00:14:45 01-05-2015 10:25:00 5 D D4 02-05-2015 13:10:00 NULL 6 B B1 02-05-2015 00:14:35 NULL

识别并消除配置单元中的重复记录

1 个答案: