将组的唯一键作为文件夹名称和包内容写为记录?

时间:2015-04-15 09:56:53

标签: apache-pig store

目标:将组的唯一键作为文件夹名称和包内容写为记录。

 File : employee.txt

 #JoiningDate   Employee Id     Employee Name
   20140302        1             A
   20140302        2             B
   20140302        3             C
   20140303        4             D
   20140303        5             E
   20140303        6             F

猪脚本:

  X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);

  Y =  group X by joining_date;

Output of this would be  (Y) :

(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})

目标是在输出路径中包含两个文件夹:

    1. outputfolder/20140302 : having three records
            20140302,1,A
            20140302,2,B    
            20140302,3,C
    2. outputfolder/20140303  : 
            20140303,4,D
            20140303,5,E
            20140303,6,F

尝试

 store Y into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');

见结果如下:

     1. outputfolder/20140302/20140302-0
            (20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
     2. outputfolder/20140303/20140303-0
            (20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})

1 个答案:

答案 0 :(得分:1)

一个选项可能只是在store命令之前展平值。

X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Z = FOREACH Y GENERATE FLATTEN($1);
store Z into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');

输出将存储在outputfolder/20140302文件夹中,文件名以此类20140302-0,000

开头