我需要使用包含7列的现有表在HIVE中创建临时表。我只想删除前三列的重复项,并在其他4列中保留相应的值。我不关心实际丢弃哪一行,而仅使用前三行进行重复数据删除。
答案 0 :(得分:2)
如果不考虑订购,可以使用以下内容
create table table2 as
select col1, col2, col3,
,split(agg_col,"|")[0] as col4
,split(agg_col,"|")[1] as col5
,split(agg_col,"|")[2] as col6
,split(agg_col,"|")[3] as col7
from (Select col1, col2, col3,
max(concat(cast(col4 as string),"|",
cast(col5 as string),"|",
cast(col6 as string),"|",
cast(col7 as string))) as agg_col
from table1
group by col1,col2,col3 ) A;
下面是另一种方法,它可以对排序进行大量控制,但比上述方法更慢
create table table2 as
select col1, col2, col3,max(col4), max(col5), max(col6), max(col7)
from (Select col1, col2, col3,col4, col5, col6, col7,
rank() over ( partition by col1, col2, col3
order by col4 desc, col5 desc, col6 desc, col7 desc ) as col_rank
from table1 ) A
where A.col_rank = 1
GROUP BY col1, col2, col3;
rank()over(..)函数返回多个列,排名为' 1'如果按列排序都相等。在我们的例子中,如果有两列对于所有七列具有完全相同的值,那么当我们使用filter作为col_rank = 1时将存在重复。可以使用上面查询中所写的max和group by子句来重复这些重复项。