基于多个文件对包中的元组进行排序

时间:2015-10-19 09:58:18

标签: apache-pig

我正在尝试根据三个字段按降序对包中的元组进行排序..

示例:假设我通过分组创建了以下包:

{(s,3,my),(w,7,pr),(q,2,je)}

我想基于$ 0,$ 1,$ 2字段对上面的分组包中的元组进行排序,首先它将对所有元组的$ 0进行排序。它将选择具有最大$ 0值的元组。如果$ 0对于所有元组都相同,那么它将按$ 1排序等等。

通过迭代过程对所有分组行包进行排序。

假设我们有像databag这样的东西:

{(21,25,34),(21,28,64),(21,25,52)}

然后根据要求输出应该是:

{(21,25,34),(21,25,52),(21,28,64)}

如果您需要更多说明,请与我们联系

1 个答案:

答案 0 :(得分:1)

在嵌套foreach中订购您的元组。这将有效。

输入:

(1,s,3,my)
(1,w,7,pr)
(1,q,2,je)


A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;                                                                                            
C = FOREACH B GENERATE A;                                                                                    
D = FOREACH C {                                                                                              
 od = ORDER A BY b, c, d;                                                                                     
 GENERATE od;                                                                                                 
 };

DUMP C结果(类似于您的数据):

({(1,s,3,my),(1,w,7,pr),(1,q,2,je)})

输出:

({(1,q,2,je),(1,s,3,my),(1,w,7,pr)})

这适用于所有情况。

生成具有最高值的元组:

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;                                                                                            
C = FOREACH B GENERATE A;                                                                                    
D = FOREACH C {  
 od = ORDER A BY b desc , c desc , d desc;
 od1 = LIMIT od 1;                        
 GENERATE od1;                            
 };
dump D;

如果所有三个字段都不同,如果所有元组都相同或者字段1和字段2相同,则生成具有最高值的元组,然后返回所有元组。

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;                                                                                            
C = FOREACH B GENERATE A; 
F = RANK C; //rank used to separate out the value if two tuples are same                                    
R = FOREACH F {    
dis = distinct A;                                      
GENERATE rank_C,COUNT(dis) AS (cnt:long),A;                 
};
R3 = FILTER R BY cnt!=1; // filter if all the tuples are same
 R4 = FOREACH R3 {                          
 fil1 = ORDER A by b desc, c desc, d desc;
 fil2 = LIMIT fil1 1;                       
 GENERATE rank_C,fil2;                             
 }; // find largest tuple except if all the tuples are same.
R5 = FILTER R BY cnt==1; // only contains if all the tuples are same
R6 = FOREACH R5 GENERATE A ; // generate required fields
F1 = FOREACH F GENERATE rank_C,FLATTEN(A); 
F2 = GROUP F1 BY (rank_C, A::b, A::c); // group by field 1,field 2 
F3 = FOREACH F2 GENERATE COUNT(F1) AS (cnt1:long) ,F1; // if count = 2 then Tuples are same on field 1 and field 2
F4 = FILTER F3 BY cnt1==2; //separate that alone
F5 = FOREACH F4 {                    
DIS = distinct F1;                   
GENERATE flatten(DIS);
 };
F8 = JOIN F BY rank_C, F5 by rank_C;
F9 = FOREACH F8 GENERATE F::A;
Z = cross R4,F5; // cross done to genearte if all the tuples are different
Z1 = FILTER Z BY R4::rank_C!=F5::DIS::rank_C;
Z2 = FOREACH Z1 GENERATE FLATTEN(R4::fil2);
res = UNION Z2,R6,F9;  // Z2 - contains value if all the three fields in the tuple are diff holds highest value, 
//R6 - contains value if all the three fields in the tuple are same
//F9 - conatains if two fields of the tuples are same
dump res;