Pig中的分层组

时间:2016-03-22 19:44:56

标签: apache-pig

我有一个包含大量字段的数据集&行。我想执行分层分组,但似乎无法弄清楚如何访问分组数据集中的字段。

例如,假设我们有(id,名字,姓氏,年龄,电话,城市)。

student_details = LOAD 'student_details.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
group_1 = GROUP student_details by (age,phone,id);
group_2 = GROUP group_1 by (group.age,group.phone);
group_3 = GROUP group_2 by (group.age);

正确计算这些组我在尝试访问数据时遇到问题,例如:

data_1 = FOREACH group_1 GENERATE group.age,group.phone,group.id,COUNT(student_details.city);
data_2 = FOREACH group_2 GENERATE group.age,group.phone,COUNT(group_1.student_details.city);

最后一行会导致错误 Cannot find field city in student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}

是因为student_details是一个包,我需要运行for-each来访问包内的元组?有没有直接的方法来做到这一点?

- 更新 -

示例数据:

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
009,ABC,DEF,111,9834534343,Delhi
009,ABC,DEF,111,9834534343,Delhi
009,ABC,DEF,111,9834534343,Delhi

如果我们运行以下代码,预期输出将完全相同:

student_details = LOAD 'student_details.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
group_1 = GROUP student_details by (age,phone,id);
group_2 = GROUP student_details by (age,phone);
data_1 = FOREACH group_1 GENERATE group.age,group.phone,group.id,COUNT(student_details.city);
data_2 = FOREACH group_2 GENERATE group.age,group.phone,COUNT(student_details.city);
STORE data_1..
STORE data_2..

但我不想在第2行和第3行使用 student_details 两次。

This question谈论在分组后丢弃元组。我不想删除任何元组,我想在一个键的子集上做另一个组 使用FLATTEN意味着我放弃了group_1中执行的组。

1 个答案:

答案 0 :(得分:1)

对于分层GROUP BY,您需要CUBE操作。以下示例可能会解决您的问题:

student_details = LOAD 'data.csv' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
cubed = CUBE student_details BY ROLLUP(age,phone,id);
result = FOREACH cubed GENERATE FLATTEN(group) as (age,phone,id),  COUNT_STAR(cube) as CNT;
result = FILTER result BY age is not NULL and phone is not NULL;
DUMP result;