我在HDFS中有以下三个文件中的数据
EmployeeManagers.txt(EmpID,ManagerID)
1,5
2,4
3,4
4,6
5,6
EmployeeNames.txt(EmpID,Name)
1,Ronald Rays
2,Jimmy Kent
3,Shannon Witt
4,Krinton Kale
5,Harry Donal
6,Christina Fernandez
EmployeeSalary.txt(EmpID,Salary)
1,1000
2,2000
3,3000
4,4000
5,5000
6,6000
我想通过格式ID,员工姓名,薪资,经理名称从这些文件创建RDD来打印数据。
我已根据密钥(即每个文本文件中的第一列)加入3个RDD,并且能够打印管理员ID但不能打印管理员名称。
这是我写的代码。
val manager = sc.textFile("EmployeeManagers")
val managerRDD = manager.map(x => (x.split(",")(0), x.split(",")(1)))
val name = sc.textFile("EmployeeNames")
val namePairRDD = name.map(x => (x.split(",")(0), x.split(",")(1)))
val salary = sc.textFile("EmployeeSalary")
val salaryPairRDD = salary.map(x => (x.split(",")(0), x.split(",")(1)))
val data = namePair.join(salaryPair).join(managerPair)
当前输出如下所示
scala> data.collect();
res4: Array[(String, ((String, String), String))] = Array((4,((Krinton Kale,4000),6)), (5,((Harry Donal,5000),6)), (2,((Jimmy Kent,2000),4)), (3,((Shannon Witt,3000),4)), (1,((Ronald Rays,1000),5)))
答案 0 :(得分:3)
好吧,你必须再次加入namePairRDD
,这次将经理ID作为关键:
val result = namePairRDD
.join(salaryPairRDD)
.join(managerPairRDD)
.map { case (id, ((name, salary), mngrId)) => (mngrId, (id, name, salary)) }
.join(namePairRDD) // join again, this time on managerId
.map { case (_, ((id, name, salary), mngrName)) => (id, name, salary, mngrName) }
result.foreach(println)
// (2,Jimmy Kent,2000.0,Krinton Kale)
// (3,Shannon Witt,3000.0,Krinton Kale)
// (1,Ronald Rays,1000.0,Harry Donal)
// (4,Krinton Kale,4000.0,Christina Fernandez)
// (5,Harry Donal,5000.0,Christina Fernandez)