说我有一个关系Students
,其中包含字段grade
和teacher
。我希望按年级和老师分组,但保留每组中每个年级的所有学生的数量。类似的东西:
classes = GROUP Students BY (grade,teacher);
classes = FOREACH classes {
GENERATE
(### COUNT OF ALL STUDENTS IN GRADE ###) as grade_size,
Students as students,
teacher as teacher;
}
但我无法弄清楚如何从群组声明中进行过滤。某种过滤方式,但我不知道在小组内外对学生的成绩进行调整。
答案 0 :(得分:1)
有两种方法:
1)使用Group by grade和teacher,而不是计数,而不是Flatten和Group By grade。
classes = GROUP Students BY (grade,teacher);
teachers = FOREACH classes GENEARATE FLATTEN(group) as (grade,teacher), COUNT(Students) as perTeacehr;
grade = GROUP teachers BY grade;
result = FOREACH grade GENERATE FLATTEN(teachers), SUM(teachers.perTeacher) as perGrade;
describe result;
dump result;
2)按分级分组,而不是使用来自DataFu库的BagGroup的UDF来执行内存组,但这很容易受到堆内存异常的影响,但速度更快。