我正在尝试根据三个字段按降序对包中的元组进行排序..
示例:假设我通过分组创建了以下包:
{(s,3,my),(w,7,pr),(q,2,je)}
我想基于$ 0,$ 1,$ 2字段对上面的分组包中的元组进行排序,首先它将对所有元组的$ 0进行排序。它将选择具有最大$ 0值的元组。如果$ 0对于所有元组都相同,那么它将按$ 1排序等等。
通过迭代过程对所有分组行包进行排序。
假设我们有像databag这样的东西:
{(21,25,34),(21,28,64),(21,25,52)}
然后根据要求输出应该是:
{(21,25,34),(21,25,52),(21,28,64)}
如果您需要更多说明,请与我们联系
答案 0 :(得分:1)
在嵌套foreach
中订购您的元组。这将有效。
输入:
(1,s,3,my)
(1,w,7,pr)
(1,q,2,je)
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b, c, d;
GENERATE od;
};
DUMP C结果(类似于您的数据):
({(1,s,3,my),(1,w,7,pr),(1,q,2,je)})
输出:
({(1,q,2,je),(1,s,3,my),(1,w,7,pr)})
这适用于所有情况。
生成具有最高值的元组:
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
D = FOREACH C {
od = ORDER A BY b desc , c desc , d desc;
od1 = LIMIT od 1;
GENERATE od1;
};
dump D;
如果所有三个字段都不同,如果所有元组都相同或者字段1和字段2相同,则生成具有最高值的元组,然后返回所有元组。
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;
C = FOREACH B GENERATE A;
F = RANK C; //rank used to separate out the value if two tuples are same
R = FOREACH F {
dis = distinct A;
GENERATE rank_C,COUNT(dis) AS (cnt:long),A;
};
R3 = FILTER R BY cnt!=1; // filter if all the tuples are same
R4 = FOREACH R3 {
fil1 = ORDER A by b desc, c desc, d desc;
fil2 = LIMIT fil1 1;
GENERATE rank_C,fil2;
}; // find largest tuple except if all the tuples are same.
R5 = FILTER R BY cnt==1; // only contains if all the tuples are same
R6 = FOREACH R5 GENERATE A ; // generate required fields
F1 = FOREACH F GENERATE rank_C,FLATTEN(A);
F2 = GROUP F1 BY (rank_C, A::b, A::c); // group by field 1,field 2
F3 = FOREACH F2 GENERATE COUNT(F1) AS (cnt1:long) ,F1; // if count = 2 then Tuples are same on field 1 and field 2
F4 = FILTER F3 BY cnt1==2; //separate that alone
F5 = FOREACH F4 {
DIS = distinct F1;
GENERATE flatten(DIS);
};
F8 = JOIN F BY rank_C, F5 by rank_C;
F9 = FOREACH F8 GENERATE F::A;
Z = cross R4,F5; // cross done to genearte if all the tuples are different
Z1 = FILTER Z BY R4::rank_C!=F5::DIS::rank_C;
Z2 = FOREACH Z1 GENERATE FLATTEN(R4::fil2);
res = UNION Z2,R6,F9; // Z2 - contains value if all the three fields in the tuple are diff holds highest value,
//R6 - contains value if all the three fields in the tuple are same
//F9 - conatains if two fields of the tuples are same
dump res;