我有两个数据集,
EmployeeDetail(data set 1):-
id
name
gender
location
SalaryDetail(data set 2):-
id
salary
我需要同时加入并找出每个地方的男女平均工资。所以我尝试了以下代码。
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as
(id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation {
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group,
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};
但它低于错误
<line 6, column 22> Syntax error, unexpected symbol at or near
'JoinedEmpDetail'
任何人都可以帮助我在哪里犯错误或如何正确地做到这一点?
为了更清楚地了解我的要求,我提供了一些样本数据集。
EmpDetail.txt
1 Biswa Male Bangalore
12 Bratati Mahapatra Female Chennai
2 Bibhu kalyan Male Bangalore
3 Chinta Male Mumbai
10 Amrit Anand Male Bangalore
11 Sateesh panda Male Bangalore
4 Kirti Kumar Male Mumbai
6 Shruthi Female Chennai
7 Vijay Male Chennai
5 Bibhu Male Chennai
9 Bratati Mohanty Female Bangalore
8 Rupa Mahapatra Female Bangalore
13 Salini Female Mumbai
14 Priyanka Chopra Female Mumbai
EmpSalary.txt
1 10000
12 12000
2 15900
3 9000
10 8000
11 13400
4 7600
6 22000
7 17000
5 16800
9 9800
8 10000
13 11000
14 12500
我需要的最终结果是:
Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>
答案 0 :(得分:1)
您可以使用简单的foreach stmt
来解决此问题,因此请不要使用嵌套的foreach stmt。
Group command
在嵌套的Foreach中不起作用,它在猪身上受限制。嵌套的foreach(CROSS,DISTINCT,FILTER,FOREACH,LIMIT和ORDER BY)中只允许使用少量命令。
你能改变你的剧本吗?
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;
<强>输出:强>
(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)