我有一个清单如下。
from to duration
5 10 1
10 30 15
10 30 25
5 10 10
10 40 15
5 20 5
我需要找到像下面那样最常出现的从 - 对。
from to count
10 30 2
5 10 2
我已将它们分组为' from,to'我可以找到如下的计数。
10 30 2
10 40 1
5 20 1
5 10 2
如何仅提取最大对数。
a = load 'x' using PigStorage;
b = group a by (from, to);
c = foreach b {
d = COUNT(c);
generate group, d;};
e = group d all;
f = foreach e {
g = order e by d;
h = limit g 1;
generate group, h; };
答案 0 :(得分:1)
你可以尝试让我知道这是否适合你。
<强>更新强>
如果您没有RANK
运营商,请下载piggbank.jar
并将其设置在类路径中并尝试以下方法。
的 input.txt中强>
5 10 1
10 30 15
10 30 25
5 10 10
10 40 15
5 20 5
PigScript: 猪版&lt; 11
REGISTER /tmp/piggybank.jar;
DEFINE MyOver org.apache.pig.piggybank.evaluation.Over('myrank:int');
DEFINE MyStitch org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'input.txt' AS (from,to,duration);
B = GROUP A BY (from,to);
C = FOREACH B{
mycount = COUNT($1);
GENERATE group, mycount AS cnt;
}
D = GROUP C ALL;
E = FOREACH D {
mysort = ORDER C BY cnt DESC;
GENERATE FLATTEN(MyStitch(mysort,MyOver(mysort,'dense_rank',0,1,1)));
};
F = FILTER E BY stitched::myrank==1;
G = FOREACH F GENERATE FLATTEN(stitched::group),stitched::cnt;
DUMP G;
<强>输出:强>
(5,10,2)
(10,30,2)
PigScript: Pigversion&gt; = 11支持排名运算符
A = LOAD 'input.txt' AS (from,to,duration);
B = GROUP A BY (from,to);
C = FOREACH B{
mycount = COUNT($1);
GENERATE group, mycount AS cnt;
}
D = RANK C BY cnt DESC;
E = FILTER D BY rank_C==1;
F = FOREACH E GENERATE FLATTEN(group),cnt;
DUMP F;
<强>输出:强>
(5,10,2)
(10,30,2)
答案 1 :(得分:0)
以上内容肯定会有用。我想过写这样的逻辑。 但这里的代码很长。
A = LOAD 'input.txt' AS (from,to,duration);
B = GROUP A BY (from,to);
C = FOREACH B GENERATE FLATTEN(group) AS(from,to),COUNT(A.from) as count;
D = ORDER C BY count DESC;
E = LIMIT D 1;
F = JOIN C by count,E BY $2;
G = FOREACH F GENERATE $0,$1,$2;
如果你觉得它有用,请检查一下