寻找spark-Core中每位员工总薪水的逻辑

时间:2017-02-20 10:23:22

标签: apache-spark left-join rdd

employee.txt:

100|Surender
101|Raja

salary.txt:

100|2016-JAN|15000
100|2016-FEB|15000

你好

我正在使用scala在sparkcore中进行一些基本的实践。

要求是计算每位员工的总薪水。如果员工在工资文件中没有匹配的记录,则将其工资显示为0

我尝试了以下代码。我能够进入加入,但我不知道如何阅读无和一些而无法继续进行

有人可以帮助我达到预期的输出。

scala> val empRDD = sc.textFile("/user/cloudera/inputfiles/employee.txt")
scala> val salaryRDD = sc.textFile("/user/cloudera/inputfiles/salary.txt")
scala> val empMapRDD = empRDD.map( elem => elem.split("\\|"))
scala> val salaryMapRDD = salaryRDD.map(elem => elem.split("\\|"))
scala> val empKeyValueRDD = empMapRDD.map(elem => (elem(0),elem(1))
scala> val salaryKeyValueRDD = salaryMapRDD.map(elem => (elem(0),elem(2)))
scala> val joinedRDD = empKeyValueRDD.leftOuterJoin(salaryKeyValueRDD)
scala> joinedRDD.collect
res3: Array[(String, (String, Option[String]))] = Array((101,(Raja,None)), (100,(Surender,Some(15000))), (100,(Surender,Some(15000))))

预期输出:

Array((100,Surender,30000), (101,Raja,0))

2 个答案:

答案 0 :(得分:1)

cell.destinationLabel.text = results.placeID

val joinedRDD = empKeyValueRDD.leftOuterJoin(salaryKeyValueRDD) .groupBy(x => (x._1, x._2._1)) .map(r => { val sal = r._2.map(x => x._2._2 match { case None => 0 case Some(num) => num.toLong }).sum (r._1._1, r._1._2, sal) }) println(joinedRDD.collect.toList) //List((100,Surender,30000), (101,Raja,0)) 之后,中间数据将是这样的

groupBy(x => (x._1, x._2._1))

答案 1 :(得分:0)

我尝试了下面的代码样式,我得到了结果

 ...
 scala> joinedRDD.map( elem => ((elem._1, elem._2._1),elem._2._2 match { case Some(i) => i.toInt case None => 0 } ) ).reduceByKey((x,y) => x+y).map(elem => (elem._1._1,elem._1._2,elem._2)).collect

输出:

 Array[(String, String, Int)] = Array((100,Surender,30000), (101,Raja,0))

如果有其他方法可以达到相同的结果,请告诉我