在对猪脚本进行排名时给出错误

时间:2015-02-03 10:53:16

标签: apache-pig

我在hdfs文件中有以下数据:

(1,rohit1,CM,2014-12-31 13:10:23,2014-12-31 13:10:23,2015-02-02 23:23:45,9999-12-31 00:00:00)
(2,rohit2,GM,2014-12-28 14:20:23,2014-12-28 14:20:23,2015-02-02 23:23:45,9999-12-31 00:00:00)
(3,rohit3,CM,2014-12-27 17:40:53,2014-12-27 17:40:53,2015-02-02 23:23:45,9999-12-31 00:00:00)
(4,rohit4,CM,2015-01-20 16:30:26,2015-01-20 16:30:26,2015-02-02 23:23:45,9999-12-31 00:00:00)
(5,rohit5,CM,2015-01-22 14:20:25,2015-01-22 14:20:25,2015-02-02 23:23:45,9999-12-31 00:00:00)
(6,rohit6,GM,2015-01-24 14:20:34,2015-01-24 14:20:34,2015-02-02 23:23:45,9999-12-31 00:00:00)
(7,rohit7,CM,2015-01-25 11:50:58,2015-01-25 11:50:58,2015-02-02 23:23:45,9999-12-31 00:00:00)
(1,rohit1,KM,2014-12-21 13:10:23,2014-12-21 13:10:23,2015-02-01 13:23:45,9999-12-31 00:00:00)
(2,rohit9,GM,2014-12-21 14:20:23,2014-12-21 14:20:23,2015-02-01 13:23:45,9999-12-31 00:00:00)

我需要对记录进行排名,并希望通过更新的降序按ID和顺序对其进行分区。为此,我根据id对数据进行了分组,如下所示:

load file data in A,

final_data_group = group A by id;

ranking_data = RANK final_data_group by updated desc;

但是给出以下错误:

2015-02-03 16:21:37,555 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
<line 10, column 46> Invalid field projection. Projected field [updated] does not exist in schema: group:int,filter_high_date_records:bag{:tuple(id:int,name:chararray,margin_value:chararray,created:chararray,updated:chararray,start_date:chararray,end_date:chararray)}.

有人可以帮我解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

你正在寻找的是给小组排名,这对于Pig内置功能很困难。

不支持基于Bag的预计字段的排名,只有您可以排名的元组文件。在这种情况下,你可以通过组字段进行RANK工作。

Pig - RANK Operation on Groups

以上链接说明了如何使用datafu库对组进行RANK