从Hive中的计数获得百分比

时间:2014-09-06 06:02:44

标签: sql hadoop hive

我有一张表格如下:

COL1    COL2    DATETIMESTAMP   CATEGORY1   CATEGORY2
e-12    1101    201408110525    Arts and Entertainment  Television
e-12    1101    201408110525    Arts and Entertainment  Television
e-12    1101    201408110525    Arts and Entertainment  Television
e-12    1101    201408110620    Technology and Computing    Internet Technology
e-12    1101    201408110705    Technology and Computing    Antivirus Software
e-12    1107    201408110510    Business    Advertising
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1109    201408110505    Technology and Computing    Web Search

忽略COL1(因为它们都是相同的),对于每个COL2,都有其余字段的组合。我设法得到了重复组合的计数,结果如下:

COL1    COL2    DATETIMESTAMP   CATEGORY1   CATEGORY2   COUNT
e-12    1101    201408110525    Arts and Entertainment  Television  3
e-12    1101    201408110620    Technology and Computing    Internet Technology 1
e-12    1101    201408110705    Technology and Computing    Antivirus Software  1
e-12    1107    201408110510    Business    Advertising 1
e-12    1107    201408110520    Business    Marketing   7
e-12    1109    201408110505    Technology and Computing    Web Search  1

如何将计数转换为每个COL2的所有组合的百分比?

对不起,我不能用文字说出来,但输出应该是这样的:

COL1    COL2    DATETIMESTAMP   CATEGORY1   CATEGORY2   COUNT   PERCENTAGE
e-12    1101    201408110525    Arts and Entertainment  Television  3   60%
e-12    1101    201408110620    Technology and Computing    Internet Technology 1   20%
e-12    1101    201408110705    Technology and Computing    Antivirus Software  1   20%
e-12    1107    201408110510    Business    Advertising 1   12.5%
e-12    1107    201408110520    Business    Marketing   7   87.5%
e-12    1109    201408110505    Technology and Computing    Web Search  1   100%

注意:此时,不需要计数。

这在Hive中甚至可能吗?如何修改我的计数查询(如下)以输出最后一个表?

SELECT COL1, COL2, DATETIMESTAMP, CATEGORY1, CATEGORY2, count(*) FROM temp_table GROUP BY CATEGORY1, CATEGORY2, DATETIMESTAMP, COL2, COL1 SORT BY COL2;

感谢。

1 个答案:

答案 0 :(得分:1)

有几种方法我可以想到这样做。您可以计算百分比中的分母,然后将其连接回原始数据,然后SUM除以总和。此外,如果您可以访问Hive中的windowing functions(我相信它们附带0.13),您可以在OVER中使用PARTITIONSELECT语句来避免所描述的连接在第一部分。

<强>#1:

select col2, cat1, cat2, datetimestamp
    ,(COUNT(cat2) / MAX(total_)) as perc
from (
    select n.col2, cat1, cat2, datetimestamp, x.total_
    from some_table as n
    JOIN (
        select col2, COUNT(col2) as total_
        from some_table
        group by col2
         ) x
    ON x.col2 = n.col2
     ) y
group by cat1, cat2, col2, datetimestamp

<强>#2:

select col2, cat1, cat2, datetimestamp
    ,(COUNT(col2) / MAX(total)) as perc
from (
    select col2, cat1, cat2
        ,COUNT(cat1) OVER (PARTITION BY col2) as total
    from some_table
     ) x
group by cat1, cat2, col2, datetimestamp