通过猪脚本计算平均值

时间:2016-07-13 16:47:59

标签: apache-pig

/*calculating average for itemssold(int) grouped by city */

a = LOAD 'sales.txt' USING PigStorage(','); /*loading sales data and it has 50 fields that are comma separated*/ 
b = FOREACH a GENERATE $3 as city:chararray, $4 as itemssold:int;/*defining schema for needed fields*/
c = GROUP b BY city; /*grouping by city*/
d = FOREACH c GENERATE group,AVG(b.itemssold); /*calculating average*/
dump d; /*writing output*/

在这里,我试图计算按城市分组的项目的平均值。

  

错误:计算平均值时出错。

有人可以帮我解决这个错误吗?

注意:由于sales.txt有50个以逗号分隔的字段,因此我不想在将sales.txt加载到关系本身时为所有字段定义模式。

2 个答案:

答案 0 :(得分:0)

也许您的数据包含一些缺失值,请尝试首先对其进行过滤:

no_nulls = FILTER b BY itemssold is not null;

答案 1 :(得分:0)

a = LOAD 'sales.txt' USING PigStorage(',');
b = FOREACH a GENERATE (chararray) $3 as city, (int) $4 as itemssold;
c = GROUP b BY city; 
d = FOREACH c GENERATE group,AVG(b.itemssold); 
dump d; 

以上代码有效。在关系b GENERATE中,我试图定义模式而不是铸造,因此猪感到困惑。纠正了现在&有效。谢谢大家的建议。