我在PIG中使用Limit
时遇到了一个问题。
Limit
的结果已排序,但我不希望对结果进行排序。
来自网站上的示例:
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
使用Limit
X = LIMIT A 3;
DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
是否有可能显示前3行没有在reuslt中排序?
(1,2,3)
(4,2,1)
(8,3,4)
我的代码如下:
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = FOREACH C {
topnresult = LIMIT B $lines;
GENERATE FLATTEN(topnresult);
}
dump D;
非常感谢。
答案 0 :(得分:1)
默认情况下,LIMIT会在内部执行 ORDER 命令,然后执行 LIMIT 命令,因此显然您将获得排序列表。有很多方法可以解决这个问题,一个选项可能是
<强> input.txt中强>
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
<强> PigScript:强>
A = LOAD 'input.txt' AS (a1:int,a2:int,a3:int);
B = RANK A;
C = FILTER B BY rank_A<=3;
D = FOREACH C GENERATE a1,a2,a3;
DUMP D;
<强>输出:强>
(1,2,3)
(4,2,1)
(8,3,4)
<强>选项2:强>
A = LOAD 'input.txt' AS (a1:int,a2:int,a3:int);
B = GROUP A ALL;
C = FOREACH B {
top3list = LIMIT A 3;
GENERATE FLATTEN(top3list);
}
DUMP C;
<强>输出:强>
(1,2,3)
(4,2,1)
(8,3,4)
更新:解决方案1
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = GROUP C ALL;
E = FOREACH D {
topnresult = LIMIT C $lines;
GENERATE FLATTEN(topnresult);
}
DUMP E;
<强>溶液2:强>
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = RANK C;
E = FILTER D BY rank_C<=$lines;
F = FOREACH E GENERATE $1..;
DUMP F;
我已使用以下命令行测试了解决方案,并且其工作正常
>pig -x local -param input='input.txt' -param s_field='$0,$1,$2' -param pattern='$0<10' -param lines=3 myscript.pig