从PIG中的SUM中获取MAX

时间:2015-09-30 06:46:03

标签: hadoop apache-pig bigdata

    player = LOAD 'ass2_player' USING org.apache.hive.hcatalog.pig.HCatLoader();
player = FOREACH player GENERATE
    (chararray)$3 AS tmID,
    (int)$1 AS year,
    (int)$8 AS points;
group_data = GROUP player BY (year, tmID);
sum_data = FOREACH group_data GENERATE group, SUM(player.points) AS tot_points;
max_data = FOREACH sum_data GENERATE FLATTEN(group), MAX(sum_data.tot_points);
DUMP max_data;

我只想选择每年最高点的团队的tmID。

如何获取整行或部分字段或具有最大值的行。 比如,在逐年之后,该组仅包含"年"和元组将采用" tmID,tot_points"。我怎么样: (年,tmID,tot_points)每年。

1 个答案:

答案 0 :(得分:0)

你快到了。以下是sum_data的架构:

((year, tmID), tot_points)

从这里开始,您需要group年,并max tot_points。如果您flatten仅在sum_data步骤中进行分组,则会更容易,例如:

sum_data = FOREACH group_data GENERATE flatten(group) as (year, tmID), SUM(player.points) AS tot_points;

sum_data_grouped = GROUP sum_data BY year;
max_data = FOREACH sum_data_grouped GENERATE group AS year, MAX(sum_data.tot_points) AS max_points, sum_data.tmpID AS tmID;

您的最终脚本应如下所示:

player = LOAD 'ass2_player' USING org.apache.hive.hcatalog.pig.HCatLoader();
player = FOREACH player GENERATE (chararray)$3 AS tmID, (int)$1 AS year, (int)$8 AS points;
group_data = GROUP player BY (year, tmID);
sum_data = FOREACH group_data GENERATE flatten(group) AS (year, tmID), SUM(player.points) AS tot_points;
sum_data_grouped = GROUP sum_data BY year;
max_data = FOREACH sum_data_grouped GENERATE group AS year, MAX(sum_data.tot_points) AS max_points, sum_data.tmpID AS tmID;

PS:我在手机上写了这个并没有测试过。让我知道你得到了什么。