我有一个包含大量字段的数据集&行。我想执行分层分组,但似乎无法弄清楚如何访问分组数据集中的字段。
例如,假设我们有(id,名字,姓氏,年龄,电话,城市)。
student_details = LOAD 'student_details.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
group_1 = GROUP student_details by (age,phone,id);
group_2 = GROUP group_1 by (group.age,group.phone);
group_3 = GROUP group_2 by (group.age);
正确计算这些组我在尝试访问数据时遇到问题,例如:
data_1 = FOREACH group_1 GENERATE group.age,group.phone,group.id,COUNT(student_details.city);
data_2 = FOREACH group_2 GENERATE group.age,group.phone,COUNT(group_1.student_details.city);
最后一行会导致错误
Cannot find field city in student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}
是因为student_details是一个包,我需要运行for-each来访问包内的元组?有没有直接的方法来做到这一点?
- 更新 -
示例数据:
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
009,ABC,DEF,111,9834534343,Delhi
009,ABC,DEF,111,9834534343,Delhi
009,ABC,DEF,111,9834534343,Delhi
如果我们运行以下代码,预期输出将完全相同:
student_details = LOAD 'student_details.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
group_1 = GROUP student_details by (age,phone,id);
group_2 = GROUP student_details by (age,phone);
data_1 = FOREACH group_1 GENERATE group.age,group.phone,group.id,COUNT(student_details.city);
data_2 = FOREACH group_2 GENERATE group.age,group.phone,COUNT(student_details.city);
STORE data_1..
STORE data_2..
但我不想在第2行和第3行使用 student_details 两次。
This question谈论在分组后丢弃元组。我不想删除任何元组,我想在一个键的子集上做另一个组 使用FLATTEN意味着我放弃了group_1中执行的组。
答案 0 :(得分:1)
对于分层GROUP BY,您需要CUBE操作。以下示例可能会解决您的问题:
student_details = LOAD 'data.csv' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
cubed = CUBE student_details BY ROLLUP(age,phone,id);
result = FOREACH cubed GENERATE FLATTEN(group) as (age,phone,id), COUNT_STAR(cube) as CNT;
result = FILTER result BY age is not NULL and phone is not NULL;
DUMP result;