Question

我在hive的表中有多个列，大约有80列。我需要在某些列上应用distinct子句，并从其他列获取第一个值。以下是我想要实现的目标。

select distinct(col1,col2,col3),col5,col6,col7
from abc where col1 = 'something';

上面提到的所有列都是文本列。所以我不能应用group by和aggregate函数。

Answer 1

您可以使用row_number功能解决问题。

create table temp as
select *, row_number() over (partition by col1,col2,col3) as rn
from abc 
where col1 = 'something';

select *
from temp
where rn=1

您还可以在分区时对表进行排序。

row_number() over (partition by col1,col2,col3 order by col4 asc) as rn

Answer 2

DISTINCT是SQL中使用率最高，功能最少的函数。这是对整个结果集执行的最后一件事，并使用select中的所有列删除重复项。您可以使用字符串执行GROUP BY，事实上这就是答案：

SELECT col1,col2,col3,COLLECT_SET(col4),COLLECT_SET(col5),COLLECT_SET(col6)
FROM abc WHERE col1 = 'something'
GROUP BY col1,col2,col3;

现在我重新阅读了你的问题，但我不确定你追求的是什么。您可能必须将表连接到其自身的聚合。

在特定列上选择distinct，但在hive中也选择其他列

2 个答案: