我有一个表格,其中包含A列和B列中的样本CDR数据,其中包含呼叫人和被叫人移动电话号码 我需要找到有最大呼叫次数的人(A栏) 并且还需要找到哪个号码(列B)最多被称为
表结构如下所示
在上表中889578226有最多的拨出电话和77382596最多被叫号码这种方式需要获取输出
在hive中我的运行方式如下
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
猪的上述查询可能是什么等价代码?
答案 0 :(得分:0)
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)