基于来自另一个RDD的第一个字段的值来重新获得现有RDD的第二字段的值

时间:2017-01-25 12:39:10

标签: scala apache-spark bigdata

我在HDFS中有以下三个文件中的数据

EmployeeManagers.txt(EmpID,ManagerID)

1,5
2,4
3,4
4,6
5,6

EmployeeNames.txt(EmpID,Name)

1,Ronald Rays
2,Jimmy Kent
3,Shannon Witt
4,Krinton Kale
5,Harry Donal
6,Christina Fernandez

EmployeeSalary.txt(EmpID,Salary)

1,1000
2,2000
3,3000
4,4000
5,5000
6,6000

我想通过格式ID,员工姓名,薪资,经理名称从这些文件创建RDD来打印数据。

我已根据密钥(即每个文本文件中的第一列)加入3个RDD,并且能够打印管理员ID但不能打印管理员名称。

这是我写的代码。

val manager = sc.textFile("EmployeeManagers")
val managerRDD = manager.map(x => (x.split(",")(0), x.split(",")(1)))
val name = sc.textFile("EmployeeNames")
val namePairRDD = name.map(x => (x.split(",")(0), x.split(",")(1)))
val salary = sc.textFile("EmployeeSalary")
val salaryPairRDD = salary.map(x => (x.split(",")(0), x.split(",")(1)))
val data = namePair.join(salaryPair).join(managerPair)

当前输出如下所示

scala> data.collect();
res4: Array[(String, ((String, String), String))] = Array((4,((Krinton Kale,4000),6)), (5,((Harry Donal,5000),6)), (2,((Jimmy Kent,2000),4)), (3,((Shannon Witt,3000),4)), (1,((Ronald Rays,1000),5)))

1 个答案:

答案 0 :(得分:3)

好吧,你必须再次加入namePairRDD ,这次将经理ID作为关键:

val result = namePairRDD
  .join(salaryPairRDD)
  .join(managerPairRDD)
  .map { case (id, ((name, salary), mngrId)) => (mngrId, (id, name, salary)) }
  .join(namePairRDD) // join again, this time on managerId
  .map { case (_, ((id, name, salary), mngrName)) => (id, name, salary, mngrName) }

result.foreach(println)
// (2,Jimmy Kent,2000.0,Krinton Kale)
// (3,Shannon Witt,3000.0,Krinton Kale)
// (1,Ronald Rays,1000.0,Harry Donal)
// (4,Krinton Kale,4000.0,Christina Fernandez)
// (5,Harry Donal,5000.0,Christina Fernandez)