阿帕奇猪分组后排序

时间:2014-12-18 03:21:51

标签: apache-pig

我有一个清单如下。

from    to  duration 
5       10  1
10      30  15
10      30  25
5       10  10
10      40  15
5       20  5

我需要找到像下面那样最常出现的从 - 对。

from    to  count 
10      30      2
5       10      2

我已将它们分组为' from,to'我可以找到如下的计数。

10  30  2
10  40  1
5   20  1
5   10  2

如何仅提取最大对数。

a = load 'x' using PigStorage;
b = group a by (from, to);
c = foreach b {
d = COUNT(c);
generate group, d;};
e = group d all;
f = foreach e {
g = order e by d;
h = limit g 1;
generate group, h; };

2 个答案:

答案 0 :(得分:1)

你可以尝试让我知道这是否适合你。

<强>更新
如果您没有RANK运营商,请下载piggbank.jar并将其设置在类路径中并尝试以下方法。
input.txt中

5       10      1
10      30      15
10      30      25
5       10      10
10      40      15
5       20      5

PigScript: 猪版&lt; 11

    REGISTER /tmp/piggybank.jar;

    DEFINE MyOver org.apache.pig.piggybank.evaluation.Over('myrank:int');
    DEFINE MyStitch org.apache.pig.piggybank.evaluation.Stitch;

    A = LOAD 'input.txt' AS (from,to,duration);
    B = GROUP A BY (from,to);
    C = FOREACH B{
                    mycount = COUNT($1);
                    GENERATE group, mycount AS cnt;
                 }
    D = GROUP C ALL;
    E = FOREACH D  {
                      mysort = ORDER C BY cnt DESC;
                      GENERATE FLATTEN(MyStitch(mysort,MyOver(mysort,'dense_rank',0,1,1)));
                   };
    F = FILTER E BY stitched::myrank==1;
    G = FOREACH F GENERATE FLATTEN(stitched::group),stitched::cnt;
    DUMP G;

<强>输出:

(5,10,2)
(10,30,2)

PigScript: Pigversion&gt; = 11支持排名运算符

A = LOAD 'input.txt' AS (from,to,duration);
B = GROUP A BY (from,to);
C = FOREACH B{
                mycount = COUNT($1);
                GENERATE group, mycount AS cnt;
             }
D = RANK C BY cnt DESC;
E = FILTER D BY rank_C==1;
F = FOREACH E GENERATE FLATTEN(group),cnt;
DUMP F;

<强>输出:

(5,10,2)
(10,30,2)

答案 1 :(得分:0)

以上内容肯定会有用。我想过写这样的逻辑。  但这里的代码很长。

A = LOAD 'input.txt' AS (from,to,duration);
B = GROUP A BY (from,to);
C = FOREACH B GENERATE FLATTEN(group) AS(from,to),COUNT(A.from) as count;
D = ORDER C BY count DESC;
E = LIMIT D 1;
F = JOIN C by count,E BY $2;
G = FOREACH F GENERATE $0,$1,$2;  

如果你觉得它有用,请检查一下