我需要在表的每一列上执行一些数值运算(使用UDF)。对于每一列我都得到2个值(mean和standard-dev)。但最终结果如(mean_1, sd_1, mean_2, sd_2, mean_3, sd_3...)
,其中1,2...
是列索引。但我需要单独一行中每列的输出。喜欢:
mean_1, sd_1 \\for col1
mean_2, sd_2 \\for col2
...
这是我正在使用的猪脚本:
data = LOAD 'input_file.csv' USING PigStorage(',') AS (C0,C1,C2);
grouped_data = GROUP data ALL;
res = FOREACH grouped_data GENERATE FLATTEN(data), AVG(data.$1) as mean, COUNT(data.$1) as count;
tmp = FOREACH res {
diff = (C1-mean)*(C1-mean);
GENERATE *,diff as diff;
};
grouped_diff = GROUP tmp all;
sq_tmp = FOREACH grouped_diff GENERATE flatten(tmp), SUM(tmp.diff) as sq_sum;
stat_tmp = FOREACH sq_tmp GENERATE mean as mean, sq_sum/count as variance, SQRT(sq_sum/count) as sd;
stats = LIMIT stat_tmp 1;
有人可以指导我如何实现这个目标吗?
答案 0 :(得分:0)
感谢。我通过为各列创建mean和sd值的元组然后将所有这些元组存储在一个包中来获得所需的输出。然后在下一步我弄平了包。
tupled_stats = FOREACH raw_stats generate TOTUPLE(mean_0, var_0, sd_0) as T0, TOTUPLE(mean_1, var_1, sd_1) as T1, TOTUPLE(mean_2, var_2, sd_2) as T2;
bagged_stats = FOREACH tupled_stats generate TOBAG(T0, T1, T2) as B;
stats = foreach bagged_stats generate flatten(B);