GROUP BY,过滤和订单数据

时间:2016-12-10 02:42:20

标签: apache-pig

我有以下格式的数据

(id_type:chararray,id:long,date_cnt_bag:{(date:chararray,count:long)})

我只有4种不同的ID类型

id_type: A/B/C/D

我想将此数据转换为以下格式

(id, date, A_count, B_count, C_count, D_count)

E.g。

A 1 {(20161209, 100),(20161208, 90),(20161207, 80)}
B 1 {(20161209, 1000),(20161208, 900),(20161207, 800)}
C 1 {(20161209, 100),(20161208, 90)}
D 1 {(20161209, 10),(20161208, 9),(20161207, 8)}
A 2 {(20161209, 100),(20161208, 90),(20161207, 80)}
B 2 {(20161209, 1000),(20161207, 800)}
C 2 {(20161209, 100),(20161208, 90),(20161207, 80)}
D 2 {(20161209, 10),(20161208, 9),(20161207, 8)}

输出应如下所示。另请注意,如果该日期缺少计数,我想放0。

1 20161209 (100 1000 100 10)
1 20161208 (90 900 90 9)
1 20161207 (80 800 0 8)
2 20161209 (100 1000 100 10)
2 20161208 (90 0 90 9)
2 20161207 (80 800 80 8)

我搜索了可能的解决方案和提示。但我不会去任何地方。提前谢谢。

2 个答案:

答案 0 :(得分:1)

raw = LOAD '...' using PigStorage(...) as (id_type:chararray,id:long,date_cnt_bag:{(date:chararray,count:long)});

flattened = foreach raw generate id_type, id, flatten(date_cnt_bag);

final = foreach (group flattened by (id, date)) {
  A_count_g = filter flattened by id_type == 'A';
  A_count = foreach A_count_g generate count as a_cnt;
  ... same for all 4 ...
  generate
    group.id,
    group.date,
    flatten(A_count) 
  ;
};

答案 1 :(得分:1)

@Vinay这是一个脚本,你将获得0并且不会错过空信息行的数据,(2,20161208,90,0,90,9)如果数据未经过验证,则空行将丢失。

raw = LOAD '/user/data/grp_ABCD.txt' using PigStorage('\t') as (id_type:chararray,id:long,date_cnt_bag:{(date:chararray,count:long)});

flattened = foreach raw generate id_type, id, flatten(date_cnt_bag);

final = foreach (group flattened by (id, date)) {
  A_count_g = filter flattened by id_type == 'A';
  A_count = foreach A_count_g generate count  as a_cnt;
  B_count_g = filter flattened by id_type == 'B';
  B_count = foreach B_count_g generate count  as b_cnt;      
  C_count_g = filter flattened by id_type == 'C';
  C_count = foreach C_count_g generate count as c_cnt;  
  D_count_g = filter flattened by id_type == 'D';
  D_count = foreach D_count_g generate count as d_cnt;

  generate
    group.id,
    group.date,
    flatten((IsEmpty(A_count)  ?  {((long)0)} : A_count )),    
    flatten((IsEmpty(B_count)  ?  {((long)0)} : B_count )),
    flatten((IsEmpty(C_count)  ?  {((long)0)} : C_count )),
    flatten((IsEmpty(D_count)  ?  {((long)0)} : D_count ))
  ;
};