employee.txt:
100|Surender
101|Raja
salary.txt:
100|2016-JAN|15000
100|2016-FEB|15000
你好
我正在使用scala在sparkcore中进行一些基本的实践。
要求是计算每位员工的总薪水。如果员工在工资文件中没有匹配的记录,则将其工资显示为0
我尝试了以下代码。我能够进入加入,但我不知道如何阅读无和一些而无法继续进行
有人可以帮助我达到预期的输出。
scala> val empRDD = sc.textFile("/user/cloudera/inputfiles/employee.txt")
scala> val salaryRDD = sc.textFile("/user/cloudera/inputfiles/salary.txt")
scala> val empMapRDD = empRDD.map( elem => elem.split("\\|"))
scala> val salaryMapRDD = salaryRDD.map(elem => elem.split("\\|"))
scala> val empKeyValueRDD = empMapRDD.map(elem => (elem(0),elem(1))
scala> val salaryKeyValueRDD = salaryMapRDD.map(elem => (elem(0),elem(2)))
scala> val joinedRDD = empKeyValueRDD.leftOuterJoin(salaryKeyValueRDD)
scala> joinedRDD.collect
res3: Array[(String, (String, Option[String]))] = Array((101,(Raja,None)), (100,(Surender,Some(15000))), (100,(Surender,Some(15000))))
预期输出:
Array((100,Surender,30000), (101,Raja,0))
答案 0 :(得分:1)
cell.destinationLabel.text = results.placeID
在val joinedRDD = empKeyValueRDD.leftOuterJoin(salaryKeyValueRDD)
.groupBy(x => (x._1, x._2._1))
.map(r => {
val sal = r._2.map(x => x._2._2 match {
case None => 0
case Some(num) => num.toLong
}).sum
(r._1._1, r._1._2, sal)
})
println(joinedRDD.collect.toList)
//List((100,Surender,30000), (101,Raja,0))
之后,中间数据将是这样的
groupBy(x => (x._1, x._2._1))
答案 1 :(得分:0)
我尝试了下面的代码样式,我得到了结果
...
scala> joinedRDD.map( elem => ((elem._1, elem._2._1),elem._2._2 match { case Some(i) => i.toInt case None => 0 } ) ).reduceByKey((x,y) => x+y).map(elem => (elem._1._1,elem._1._2,elem._2)).collect
输出:
Array[(String, String, Int)] = Array((100,Surender,30000), (101,Raja,0))
如果有其他方法可以达到相同的结果,请告诉我