如何在连接两个数据集和分组后找到平均值,在Pig中

时间:2015-01-04 12:43:01

标签: apache-pig

我有两个数据集,EmployeeDetail包含4列(id,name,gender,location)和SalaryDetail(id,salary)。我加入了两个dats并将它们分组为位置。

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;                                                                         
GroupedByLocation = group JoinedEmpDetail by location; 

DUMP GroupedByLocation为我提供了我期望的正确结果。现在,当我尝试使用下面的线进行平均时,

AverageSalary = foreach GroupedByLocation generate group, AVG(SalaryDetail.salary);

它会抛出错误。

<line 11, column 58> Could not infer the matching function for org.apache.pig.builtin.AVG as multiple or none of them fit. Please use an explicit cast.

我也尝试过以下方式。但是得到了同样的错误。

AverageSalary = foreach GroupedByLocation {
  Sum = SUM(SalaryDetail.salary);
  Count = COUNT(SalaryDetail.salary);
  avgSal = Sum/Count;
  generate group as location, avgSal;
  };

这次错误是:

Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.

任何人都可以建议我这样做的正确方法。

感谢 Sivasakthi Jayaraman 回答我的问题。

AverageSalary = foreach GroupedByLocation generate group, AVG(JoinedEmpDetail.SalaryDetail::salary);

这给了我每个位置的平均工资。 现在我试图找出每个location中每个性别的平均工资。所以我尝试在gender变量GroupedByLocation内进行分组。但是面临一些问题。

GroupdByGender = foreach GroupedByLocation { 
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group, AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};

我收到了这个错误

Syntax error, unexpected symbol at or near 'JoinedEmpDetail'

任何人都可以帮忙。

1 个答案:

答案 0 :(得分:1)

您无法像这样访问salary列,首先需要预测JoinedEmpDetail关系,然后访问salary列。

你能试试下面的stmt吗?

AverageSalary = foreach GroupedByLocation generate group, AVG(JoinedEmpDetail.SalaryDetail::salary);