我的数据格式为student_id, course_id,grade,other_information
。这对于大量学生来说,比如数十亿。我有一个perl脚本,用于处理学生的数据。因此,考虑使用hadoop框架通过将每个学生的数据流式传输到perl脚本来加速该过程。
这就是我的表现:
student_data = LOAD 'source' using PigStorage('\t') As (stud_id:string,...)
grp_student = group student_data by stud_id;
final_data = foreach grp_student {
flat_data = flatten(grp_student)
each_stud_data = generate flat_data;
result = STREAM each_stud_data THROUGH 'some perl script';
}
store final_data into '/some_location';
问题:我收到此错误Syntax error, unexpected symbol at or near 'flatten'
。试图谷歌,但徒劳无功。有人可以帮忙吗?
答案 0 :(得分:1)
一些提示:在嵌套的foreach中不允许展平。 generate必须是最后一个语句。
关于Stream
命令Pig docs:
About Data Guarantees
Data guarantees are determined based on the position of the streaming operator in the Pig script.
[...]
Grouped data – The data for the same grouped key is guaranteed to be provided to the streaming application contiguously
[...]
因此,如果您调整脚本以便能够应对连续获取组密钥的所有数据这一事实,那么它可能会成功。
student_data = LOAD 'source' using PigStorage('\t') As (stud_id:string,...);
grp_student = GROUP student_data BY stud_id;
flat_data = FOREACH grp_student GENERATE FLATTEN(student_data);
result = STREAM flat_data THROUGH 'some perl script';