我在hdfs文件中有以下数据:
(1,rohit1,CM,2014-12-31 13:10:23,2014-12-31 13:10:23,2015-02-02 23:23:45,9999-12-31 00:00:00)
(2,rohit2,GM,2014-12-28 14:20:23,2014-12-28 14:20:23,2015-02-02 23:23:45,9999-12-31 00:00:00)
(3,rohit3,CM,2014-12-27 17:40:53,2014-12-27 17:40:53,2015-02-02 23:23:45,9999-12-31 00:00:00)
(4,rohit4,CM,2015-01-20 16:30:26,2015-01-20 16:30:26,2015-02-02 23:23:45,9999-12-31 00:00:00)
(5,rohit5,CM,2015-01-22 14:20:25,2015-01-22 14:20:25,2015-02-02 23:23:45,9999-12-31 00:00:00)
(6,rohit6,GM,2015-01-24 14:20:34,2015-01-24 14:20:34,2015-02-02 23:23:45,9999-12-31 00:00:00)
(7,rohit7,CM,2015-01-25 11:50:58,2015-01-25 11:50:58,2015-02-02 23:23:45,9999-12-31 00:00:00)
(1,rohit1,KM,2014-12-21 13:10:23,2014-12-21 13:10:23,2015-02-01 13:23:45,9999-12-31 00:00:00)
(2,rohit9,GM,2014-12-21 14:20:23,2014-12-21 14:20:23,2015-02-01 13:23:45,9999-12-31 00:00:00)
我需要对记录进行排名,并希望通过更新的降序按ID和顺序对其进行分区。为此,我根据id对数据进行了分组,如下所示:
load file data in A,
final_data_group = group A by id;
ranking_data = RANK final_data_group by updated desc;
但是给出以下错误:
2015-02-03 16:21:37,555 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
<line 10, column 46> Invalid field projection. Projected field [updated] does not exist in schema: group:int,filter_high_date_records:bag{:tuple(id:int,name:chararray,margin_value:chararray,created:chararray,updated:chararray,start_date:chararray,end_date:chararray)}.
有人可以帮我解决这个问题吗?
答案 0 :(得分:0)
你正在寻找的是给小组排名,这对于Pig内置功能很困难。
不支持基于Bag的预计字段的排名,只有您可以排名的元组文件。在这种情况下,你可以通过组字段进行RANK工作。
Pig - RANK Operation on Groups
以上链接说明了如何使用datafu库对组进行RANK