我有以下格式的数据
(id_type:chararray,id:long,date_cnt_bag:{(date:chararray,count:long)})
我只有4种不同的ID类型
id_type: A/B/C/D
我想将此数据转换为以下格式
(id, date, A_count, B_count, C_count, D_count)
E.g。
A 1 {(20161209, 100),(20161208, 90),(20161207, 80)}
B 1 {(20161209, 1000),(20161208, 900),(20161207, 800)}
C 1 {(20161209, 100),(20161208, 90)}
D 1 {(20161209, 10),(20161208, 9),(20161207, 8)}
A 2 {(20161209, 100),(20161208, 90),(20161207, 80)}
B 2 {(20161209, 1000),(20161207, 800)}
C 2 {(20161209, 100),(20161208, 90),(20161207, 80)}
D 2 {(20161209, 10),(20161208, 9),(20161207, 8)}
输出应如下所示。另请注意,如果该日期缺少计数,我想放0。
1 20161209 (100 1000 100 10)
1 20161208 (90 900 90 9)
1 20161207 (80 800 0 8)
2 20161209 (100 1000 100 10)
2 20161208 (90 0 90 9)
2 20161207 (80 800 80 8)
我搜索了可能的解决方案和提示。但我不会去任何地方。提前谢谢。
答案 0 :(得分:1)
raw = LOAD '...' using PigStorage(...) as (id_type:chararray,id:long,date_cnt_bag:{(date:chararray,count:long)});
flattened = foreach raw generate id_type, id, flatten(date_cnt_bag);
final = foreach (group flattened by (id, date)) {
A_count_g = filter flattened by id_type == 'A';
A_count = foreach A_count_g generate count as a_cnt;
... same for all 4 ...
generate
group.id,
group.date,
flatten(A_count)
;
};
答案 1 :(得分:1)
@Vinay这是一个脚本,你将获得0并且不会错过空信息行的数据,(2,20161208,90,0,90,9)如果数据未经过验证,则空行将丢失。
raw = LOAD '/user/data/grp_ABCD.txt' using PigStorage('\t') as (id_type:chararray,id:long,date_cnt_bag:{(date:chararray,count:long)});
flattened = foreach raw generate id_type, id, flatten(date_cnt_bag);
final = foreach (group flattened by (id, date)) {
A_count_g = filter flattened by id_type == 'A';
A_count = foreach A_count_g generate count as a_cnt;
B_count_g = filter flattened by id_type == 'B';
B_count = foreach B_count_g generate count as b_cnt;
C_count_g = filter flattened by id_type == 'C';
C_count = foreach C_count_g generate count as c_cnt;
D_count_g = filter flattened by id_type == 'D';
D_count = foreach D_count_g generate count as d_cnt;
generate
group.id,
group.date,
flatten((IsEmpty(A_count) ? {((long)0)} : A_count )),
flatten((IsEmpty(B_count) ? {((long)0)} : B_count )),
flatten((IsEmpty(C_count) ? {((long)0)} : C_count )),
flatten((IsEmpty(D_count) ? {((long)0)} : D_count ))
;
};