如何对RDD中的指定列求和

时间:2018-03-23 18:37:12

标签: rdd

以下是两个数据文件:

spark16/file1.txt
1,9,5
2,7,4
3,8,3

spark16/file2.txt
1,g,h
2,i,j
3,k,l

加入后,我有:

(1, ((9,5),(g,h)) )
(2, ((7,4),(i,j)) )
(3, ((8,3),(k,l)) )

我需要得到

  

(5,4,3)= 12

的总和

我卡在这里:

val file1 = sc.textFile(“data96/file1.txt”).map(x=>(x.split(",")(0).toInt, (x.split(",")(1), x.split(",")(2).toInt)))
val file2 = sc.textFile(“data96/file2.txt”).map(x=>(x.split(",")(0).toInt, (x.split(",")(1), x.split(",")(2))))

val joined = file1.join(file2)
val sorted = joined.sortByKey()

val first = sorted.first
res4: (Int, ((String, Int), (String, String))) = (1,((9,5),(g,h)))

scala> joined.reduce(_._2._1._2 + _._2._1.2)
:34: error: type mismatch;
found : Int
required: (Int, ((String, Int), (String, String)))
joined.reduce(._2._1._2 + _._2._1._2)

如何在 _._ 2._1._2 上获得总和?

非常感谢。

1 个答案:

答案 0 :(得分:0)

如果你在join之后得到了这个 那么

(1, ((9,5),(g,h)) )
(2, ((7,4),(i,j)) )
(3, ((8,3),(k,l)) )

然后选择您需要的列并执行reduce

joined.map(_._2._1._2).reduce(_ + _) 

这应该为5, 4, 3提供12

的总和

Reduce必须与dataType passed

一样返回

希望这有帮助!