我有一张表格如下:
COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2
e-12 1101 201408110525 Arts and Entertainment Television
e-12 1101 201408110525 Arts and Entertainment Television
e-12 1101 201408110525 Arts and Entertainment Television
e-12 1101 201408110620 Technology and Computing Internet Technology
e-12 1101 201408110705 Technology and Computing Antivirus Software
e-12 1107 201408110510 Business Advertising
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1109 201408110505 Technology and Computing Web Search
忽略COL1(因为它们都是相同的),对于每个COL2,都有其余字段的组合。我设法得到了重复组合的计数,结果如下:
COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 COUNT
e-12 1101 201408110525 Arts and Entertainment Television 3
e-12 1101 201408110620 Technology and Computing Internet Technology 1
e-12 1101 201408110705 Technology and Computing Antivirus Software 1
e-12 1107 201408110510 Business Advertising 1
e-12 1107 201408110520 Business Marketing 7
e-12 1109 201408110505 Technology and Computing Web Search 1
如何将计数转换为每个COL2的所有组合的百分比?
对不起,我不能用文字说出来,但输出应该是这样的:
COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 COUNT PERCENTAGE
e-12 1101 201408110525 Arts and Entertainment Television 3 60%
e-12 1101 201408110620 Technology and Computing Internet Technology 1 20%
e-12 1101 201408110705 Technology and Computing Antivirus Software 1 20%
e-12 1107 201408110510 Business Advertising 1 12.5%
e-12 1107 201408110520 Business Marketing 7 87.5%
e-12 1109 201408110505 Technology and Computing Web Search 1 100%
注意:此时,不需要计数。
这在Hive中甚至可能吗?如何修改我的计数查询(如下)以输出最后一个表?
SELECT COL1, COL2, DATETIMESTAMP, CATEGORY1, CATEGORY2, count(*) FROM temp_table GROUP BY CATEGORY1, CATEGORY2, DATETIMESTAMP, COL2, COL1 SORT BY COL2;
感谢。
答案 0 :(得分:1)
有几种方法我可以想到这样做。您可以计算百分比中的分母,然后将其连接回原始数据,然后SUM
除以总和。此外,如果您可以访问Hive中的windowing functions(我相信它们附带0.13),您可以在OVER
中使用PARTITION
和SELECT
语句来避免所描述的连接在第一部分。
<强>#1:强>
select col2, cat1, cat2, datetimestamp
,(COUNT(cat2) / MAX(total_)) as perc
from (
select n.col2, cat1, cat2, datetimestamp, x.total_
from some_table as n
JOIN (
select col2, COUNT(col2) as total_
from some_table
group by col2
) x
ON x.col2 = n.col2
) y
group by cat1, cat2, col2, datetimestamp
<强>#2:强>
select col2, cat1, cat2, datetimestamp
,(COUNT(col2) / MAX(total)) as perc
from (
select col2, cat1, cat2
,COUNT(cat1) OVER (PARTITION BY col2) as total
from some_table
) x
group by cat1, cat2, col2, datetimestamp