相对于某些列重复表中的行,并在HIVE的其他列中保留相应的值

时间:2015-12-07 16:50:38

标签: hive

我需要使用包含7列的现有表在HIVE中创建临时表。我只想删除前三列的重复项,并在其他4列中保留相应的值。我不关心实际丢弃哪一行,而仅使用前三行进行重复数据删除。

1 个答案:

答案 0 :(得分:2)

如果不考虑订购,可以使用以下内容

create table table2 as 
select col1, col2, col3, 
      ,split(agg_col,"|")[0] as col4
      ,split(agg_col,"|")[1] as col5
      ,split(agg_col,"|")[2] as col6
      ,split(agg_col,"|")[3] as col7
from (Select col1, col2, col3,
             max(concat(cast(col4 as string),"|", 
                        cast(col5 as string),"|",
                        cast(col6 as string),"|",
                        cast(col7 as string))) as agg_col
from table1
group by col1,col2,col3 ) A;

下面是另一种方法,它可以对排序进行大量控制,但比上述方法更慢

create table table2 as 
select col1, col2, col3,max(col4), max(col5), max(col6), max(col7)
from (Select col1, col2, col3,col4, col5, col6, col7,
             rank() over ( partition by col1, col2, col3 
                           order by col4 desc, col5 desc, col6 desc, col7 desc ) as col_rank
from table1 ) A
where A.col_rank = 1
GROUP BY col1, col2, col3;

rank()over(..)函数返回多个列,排名为' 1'如果按列排序都相等。在我们的例子中,如果有两列对于所有七列具有完全相同的值,那么当我们使用filter作为col_rank = 1时将存在重复。可以使用上面查询中所写的max和group by子句来重复这些重复项。