目标:将组的唯一键作为文件夹名称和包内容写为记录。
File : employee.txt
#JoiningDate Employee Id Employee Name
20140302 1 A
20140302 2 B
20140302 3 C
20140303 4 D
20140303 5 E
20140303 6 F
猪脚本:
X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Output of this would be (Y) :
(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})
目标是在输出路径中包含两个文件夹:
1. outputfolder/20140302 : having three records
20140302,1,A
20140302,2,B
20140302,3,C
2. outputfolder/20140303 :
20140303,4,D
20140303,5,E
20140303,6,F
尝试
store Y into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');
见结果如下:
1. outputfolder/20140302/20140302-0
(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
2. outputfolder/20140303/20140303-0
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})
答案 0 :(得分:1)
一个选项可能只是在store
命令之前展平值。
X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Z = FOREACH Y GENERATE FLATTEN($1);
store Z into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');
输出将存储在outputfolder/20140302
文件夹中,文件名以此类20140302-0,000