我有一个看起来像
的数据集gr col1 col2
A 2 'haha'
A 4 'haha'
A 3 'haha'
B 5 'hoho'
B 1 'hoho'
如您所见,在每个组中gr
都有一个数字变量col1
和一些字符串变量col2
在每个组中都是相同的。
如何在PIG中获取以下伪代码?
foreach group gt : generate the mean of col1 and get the first occurrence of col2
所以输出看起来像
gr mean name
A 3 'haha'
B 3 'hoho'
谢谢!
答案 0 :(得分:1)
GROUP BY gr,col2并获得col1的AVG。假设字段是制表符分隔的。
<强> PigScript 强>
A = load 'test6.txt' USING PigStorage('\t') as (gr:chararray,col1:int,col2:chararray);
B = GROUP A BY (gr,col2);
C = FOREACH B GENERATE FLATTEN(group) as (gr,name),AVG(A.col1) as mean;
DUMP C;
注意:如果您需要订单,请添加额外步骤
D = FOREACH C GENERATE $0 as gr,$2 as mean,$1 as name;
<强>输出强>