Hive - 根据某些列选择唯一的行

时间:2017-07-02 01:37:28

标签: hadoop group-by hive duplicates

我正在尝试将具有保存值的行分组到两列中,并根据第三列对结果进行排序/排序。

结果应包含所有其他列。

表格:

isPlaying()

c3 上的时间按 c1 c2 列进行分组或过滤,输出将为:

with sample as (
 select 'A' as c1, 'B' as c2, '22:00' as c3, 'Da' as c4
 union all
select  'A' as c1, 'B' as c2, '23:00' as c3, 'Db' as c4
 union all
select  'A' as c1, 'B' as c2, '09:00' as c3, 'Dc' as c4 
  union all
select  'A' as c1, 'C' as c2, '22:00' as c3, 'Dd' as c4
  union all
select  'B' as c1, 'C' as c2, '09:00' as c3, 'De' as c4
)

应保留 c4,c5 .. 等所有其他栏目,但不会对小组标准或排名产生任何影响。

相信一个带有 c1 c2 分区的窗口功能,并且 c3 的顺序可以正常工作,但不确定它是否可以使用 c3 。对于非常大的表以及需要按更多列分组的最佳方法。

最终输出将是一个UNIQUE行,其中rank为1(顶部)。列应与 sample 表(无排名)完全相同。

row_number() over (partition by c1, c2 order by c3) as rnk | c1, c2, c3, c4, rnk| ----------------------- | A | B |09:00| Dc| 1 | | A | B |22:00| Da| 2 | | A | B |23:00| Db| 3 | | A | C |22:00| Dd| 1 | | B | C |09:00| De| 1 | 可以完成工作,但请保留colum' rnk '。 我想避免在选择中编写所有列排除 rnk。

Select * from tableX where rnk = 1

*已修改,添加决赛桌

2 个答案:

答案 0 :(得分:3)

select  inline(array(rec))

from   (select  struct(*)   as rec

               ,row_number() over 
                (
                    partition by    c1,c2 
                    order by        c3
                ) as rn

        from    sample t
        ) t

where   rn = 1
;
+------+------+-------+------+
| col1 | col2 | col3  | col4 |
+------+------+-------+------+
| A    | B    | 09:00 | Dc   |
| A    | C    | 22:00 | Dd   |
| B    | C    | 09:00 | De   |
+------+------+-------+------+

P.S。 请注意,由于使用了struct

,列名称是别名

答案 1 :(得分:0)

我想你只想要row_number()

select t.*,
       row_number() over (partition by c1, c2 order by c3) as rnk
from sample t;

自从我回答这个问题后,这个问题似乎发生了变化 - 这是一件相当粗鲁的事情。如果您想要排名靠前的列,请使用子查询:

select t.*
from (select t.*,
             row_number() over (partition by c1, c2 order by c3) as rnk
      from sample t
     ) t
where rnk = 1;

这为数据中的每个c1 / c2组合返回一行。如果您想要关系中的所有行,请使用rank()代替row_number()