在Apache Pig(Hadoop)中使用MAX时出错

时间:2015-02-08 23:48:07

标签: hadoop apache-pig

我正在尝试计算Pig中关系中不同组的最大值。该关系有三列patientid,featureid和featurevalue(all int)。 我根据featureid对关系进行分组,并希望计算每个组的最大特征值,下面是代码:

grpd = GROUP features BY featureid;
DUMP grpd;
temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val;

它给了我无效的标量投影:grpd 异常。我在不同的论坛上看到MAX接收了一个" bag"这些函数的格式,但是当我接受grpd的转储时,它会显示一个包格式。这是转储输出的一小部分:

(5662,{(22579,5662,1)})
(5663,{(28331,5663,1),(2624,5663,1)})
(5664,{(27591,5664,1)})
(5665,{(30217,5665,1),(31526,5665,1)})
(5666,{(27783,5666,1),(30983,5666,1),(32424,5666,1),(28064,5666,1),(28932,5666,1)})
(5667,{(31257,5667,1),(27281,5667,1)})
(5669,{(31041,5669,1)})

问题是什么?

1 个答案:

答案 0 :(得分:0)

问题在于列寻址,这是正确的工作代码:

grpd = GROUP features BY featureid;
temp = FOREACH grpd GENERATE group as featureid, MAX(features.featurevalue) as val;