配置单元:根据多个条件重复数据删除

时间:2017-06-11 23:12:23

标签: hive hiveql

这是我的示例数据表

row#    date        customerid                      event       itemid-A     Itemid-B
1       5/1/17  4c9b3705121ac1493640912601          page load   473685  
2       5/1/17  11dacfc4251da01493672636536         page load   863438  
3       5/1/17  11dacfc4251da01493672636536         click       863438       45485

条件#1:我需要从数据中删除第2行,因为它是第3行的重复客户ID。基本上删除页面加载事件并在customerid重复时保持单击事件。 Click事件将具有唯一的Itemid-B

条件#2:当没有重复的customerid时,我需要在#1行中保持页面加载事件。

1 个答案:

答案 0 :(得分:1)

select  dt,customerid,event,itemid_A,Itemid_B

from   (select  * 
               ,row_number() over
                (
                    partition by    customerid
                    order by        field(event,'click','page load')
                ) as rn

        from    mytable
        ) t

where   rn = 1
; 
+------------+-----------------------------+-----------+----------+----------+
|     dt     |         customerid          |   event   | itemid_a | itemid_b |
+------------+-----------------------------+-----------+----------+----------+
| 2017-05-01 | 11dacfc4251da01493672636536 | click     | 863,438  | 45,485   |
| 2017-05-01 | 4c9b3705121ac1493640912601  | page load | 473,685  | (null)   |
+------------+-----------------------------+-----------+----------+----------+