建议让我的下面的Pig代码变得简单

时间:2016-02-25 20:51:30

标签: hadoop apache-pig

这是我的代码,我做了两个分组所有操作,我的代码工作。我的目的是生成所有学生唯一用户数及其总分数,学生位于CA唯一用户数。想知道是否有良好的建议,使我的代码只使用一个组操作,或任何建设性的想法,使代码简单,例如只使用一个FOREACH操作?感谢。

student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;

student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);

示例输入(学生ID,位置ID,mathScore,verbScore),

1 1 10  20
2 1 20  30
3 1 30  40
4 2 30  50
5 2 30  50
6 3 30  50

示例输出(唯一用户,CA中的唯一用户,所有学生的mathScore总和,所有学生的动词得分总和),

7 3 150 240
提前谢谢, 林

3 个答案:

答案 0 :(得分:1)

你可能正在寻找这个。

data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);

gdata = group data all;

result = foreach gdata {
        student_CA = filter data by lid == 1; 
        student_CA_sum = SUM( student_CA.sid ) ;
        student_CA_count = COUNT( student_CA.sid ) ;
        mathScore = SUM(data.ms);
        verbScore = SUM(data.vs);
        GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore  as mathScore, verbScore as verbScore;
 };

输出是:

grunt> dump result
    (6,3,150,240)
grunt> describe result
    result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}

答案 1 :(得分:1)

首先在hadoop文件系统中加载文件(student)。执行以下操作。

split student into student_CA if locationId == 1, student_Other if locationId != 1;

student_CA_all = group student_CA all;

student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;

student_Other_all = group student_Other all;

student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;

student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;

student_summary_all = group student_CAandOther_all_summary all;

student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;

输出:

dump student_summary;
(6,3,150,240)

希望这会有所帮助:)

在解决您的问题时,我也遇到了PIG的问题。我认为这是因为在UNION命令中完成了不正确的异常处理。实际上,如果执行该命令,它可能会挂起命令行提示符,而没有正确的错误消息。如果你愿意我可以分享你的代码片段。

答案 2 :(得分:1)

接受的答案存在逻辑错误。

尝试使用以下输入文件

$("YOUR_FORM")[0].reset()

输出

1 1 10  20
2 1 20  30
3 1 30  40
4 2 30  50
5 2 30  50
6 3 30  50
7 1 10  10

输出应为

(13,4,160,250)

我已修改脚本以使其正常工作。

(7,4.170,260)

};

输出

data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);

gdata = group data all;

result = foreach gdata {
    student_CA_sum = COUNT( data.sid ) ;
    student_CA = filter data by lid == 1;
    student_CA_count = COUNT( student_CA.sid ) ;
    mathScore = SUM(data.ms);
    verbScore = SUM(data.vs);
    GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore  as mathScore, verbScore as verbScore;