我有已经分组和聚合的数据,看起来像这样:
user value count
---- -------- ------
Alice third 5
Alice first 11
Alice second 10
Alice fourth 2
...
Bob second 20
Bob third 18
Bob first 21
Bob fourth 8
...
对于每个用户(Alice和Bob),我想要检索他们的前n个值(比方说2),排序的'count'项。 所以我想要的输出是:
Alice first 11
Alice second 10
Bob first 21
Bob second 20
我该如何实现?
答案 0 :(得分:28)
一种方法是
records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (user:chararray,value:chararray,counter:int);
grpd = GROUP records BY user;
top3 = foreach grpd {
sorted = order records by counter desc;
top = limit sorted 2;
generate group, flatten(top);
};
输入是:
Alice third 5
Alice first 11
Alice second 10
Alice fourth 2
Bob second 20
Bob third 18
Bob first 21
Bob fourth 8
输出是:
(Alice,Alice,first,11)
(Alice,Alice,second,10
(Bob,Bob,first,21)
(Bob,Bob,second,20)
答案 1 :(得分:6)
我刚刚做了一个观察
top = limit sorted 2;
top是一个内置函数,可能会抛出一个错误,所以我做的唯一一件事是在这种情况下更改了关系的名称,而不是
generate group, flatten(top);
给出了输出
(Alice,Alice,first,11)
(Alice,Alice,second,10
(Bob,Bob,first,21)
(Bob,Bob,second,20)
修改如下所示 -
records = load 'test1.txt' using PigStorage(',') as (user:chararray, value:chararray, count:int);
grpd = GROUP records BY user;
top2 = foreach grpd {
sorted = order records by count desc;
top1 = limit sorted 2;
generate flatten(top1);
};
给了我所需的输出 -
(Alice,first,11)
(Alice,second,10)
(Bob,first,21)
(Bob,second,20)
希望这有帮助。