目前我的数据以这种方式出现,但我希望我的数据显示关于pid字段更改序列的RANK。我的脚本是这样的。我已经尝试了秩操作符和密集秩操作符但仍然没有所需的输出。
trans_c1 = LOAD '/mypath/data_file.csv' using PigStorage(',') as (date,Product_id);
(DATE,Product id)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
最终输出应如下所示,其中排名顺序随着(Product_id)的变化而变化,并重置为1.猪可以这样做吗?
(1,2015-01-13T18:00:40.622+05:30,B00XT)
(2,2015-01-13T18:00:40.622+05:30,B00XT)
(3,2015-01-13T18:00:40.622+05:30,B00XT)
(4,2015-01-13T18:00:40.622+05:30,B00XT)
(1,2015-01-13T18:00:40.622+05:30,B00OZ)
(2,2015-01-13T18:00:40.622+05:30,B00OZ)
(3,2015-01-13T18:00:40.622+05:30,B00OZ)
(1,2015-01-13T18:00:40.622+05:30,B00VB)
(2,2015-01-13T18:00:40.622+05:30,B00VB)
(3,2015-01-13T18:00:40.622+05:30,B00VB)
(4,2015-01-13T18:00:40.622+05:30,B00VB)
答案 0 :(得分:0)
使用piggybank函数Stitch
和Over
可以解决这个问题。它也可以通过使用dataFu的Enumerate
函数来解决。
使用Piggybank功能的脚本:
REGISTER <path to piggybank folder>/piggybank.jar;
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE Over org.apache.pig.piggybank.evaluation.Over('int');
input_data = LOAD 'data_file.csv' USING PigStorage(',') AS (date:chararray, pid:chararray);
group_data = GROUP input_data BY pid;
rank_grouped_data = FOREACH group_data GENERATE FLATTEN(Stitch(input_data, Over(input_data, 'row_number')));
display_data = FOREACH rank_grouped_data GENERATE stitched::result AS rank_number, stitched::date AS date, stitched::pid AS pid;
DUMP display_data;
脚本使用dataFu的枚举功能:
REGISTER <path to pig libraries>/datafu-1.2.0.jar;
DEFINE Enumerate datafu.pig.bags.Enumerate('1');
input_data = LOAD 'data_file.csv' USING PigStorage(',') AS (date:chararray, pid:chararray);
group_data = GROUP input_data BY pid;
data = FOREACH group_data GENERATE FLATTEN(Enumerate(input_data));
display_data = FOREACH data GENERATE $2, $0, $1;
DUMP display_data;
可以从Maven存储库下载DataFu jar文件:http://search.maven.org/#search%7Cga%7C1%7Cg%3a%22com.linkedin.datafu%22
<强>输出:强>
(1,2015-01-13T18:00:40.622+05:30,B00OZ)
(2,2015-01-13T18:00:40.622+05:30,B00OZ)
(3,2015-01-13T18:00:40.622+05:30,B00OZ)
(1,2015-01-13T18:00:40.622+05:30,B00VB)
(2,2015-01-13T18:00:40.622+05:30,B00VB)
(3,2015-01-13T18:00:40.622+05:30,B00VB)
(4,2015-01-13T18:00:40.622+05:30,B00VB)
(1,2015-01-13T18:00:40.622+05:30,B00XT)
(2,2015-01-13T18:00:40.622+05:30,B00XT)
(3,2015-01-13T18:00:40.622+05:30,B00XT)
(4,2015-01-13T18:00:40.622+05:30,B00XT)
<强>价:强>
Implementing row number function in apache pig
Usage of Apache Pig rank function