用于计算Hive中的频率的SQL查询

时间:2018-01-11 05:42:37

标签: sql hadoop hive

我在Hive中有一个表tab,如下所示:

word | occurrences  
---- | -----------  
by   | 10
hi   | 1
same | 3
love | 6

我想使用Hive查询计算并显示单词的频率(出现次数除以整列的总和)。例如,单词'的频率为'是10 /(10 + 1 + 3 + 6)= 0.5。

我试过了:

SELECT word, occurrences, occurrences/SUM(occurrences) AS frequency
FROM tab
GROUP BY word, occurrences
ORDER BY frequency;

但它给出了这个:

word | occurrences | frequency
---- | ----------- | ---------
by   | 10          | 1
hi   | 1           | 1
same | 3           | 1
love | 6           | 1

我不确定我做错了什么。我的SQL不是很好。提前谢谢。

3 个答案:

答案 0 :(得分:0)

尝试下面的sql,在这里使用SUM() OVER()

SELECT word, occurrences, occurrences/SUM(occurrences) OVER() AS frequency
FROM tab
ORDER BY frequency;

答案 1 :(得分:0)

您不需要GROUP BY任何列,因为您希望得到分母的所有频率。

SELECT a.word, a.occurrences, a.occurrences/b.total_freq AS frequency
FROM 
tab a CROSS JOIN (SELECT SUM(occurences) AS total_freq from tab) b
ORDER BY frequency;

通过交叉连接,您可以将total_freq用于tab表的所有行,然后在外部查询中将其用作分母。

答案 2 :(得分:0)

with a1 as

(

SELECT word, occurrences, occurrences/SUM(occurrences) OVER() AS frequency
FROM tab
ORDER BY frequency

)

select * from a1