如何从Spark RDD Iiterable中获取两个元素的总和

时间:2017-12-05 04:58:16

标签: scala apache-spark

我正在尝试编写一个简单的程序来查找驱动程序记录的小时记录和里程总和的总和。我已经应用了groupByKey,RDD现在看起来像这样。

(13,CompactBuffer((49,2643), (56,2553), (60,2539), (55,2553), (45,2762), (53,2699), (46,2519), (60,2719), (56,2760), (51,2731), (57,2671), (47,2604), (58,2510), (51,2649), (56,2559), (59,2604), (47,2613), (49,2585), (58,2749), (50,2756), (57,2596), (54,2517), (48,2554), (47,2576), (58,2528), (60,2765), (54,2689), (51,2739), (51,2698), (47,2739), (51,2546), (54,2647), (60,2504), (48,2536), (51,2602), (47,2651), (53,2545), (48,2665), (55,2670), (60,2524), (48,2612), (60,2712), (60,2583), (47,2773), (57,2589), (51,2512), (57,2607), (57,2576), (53,2604), (59,2702), (51,2687), (10,100)))

你能否建议我使用一些有用的scala函数来获得2个元素的总和?谢谢!

1 个答案:

答案 0 :(得分:1)

如果我正确理解您的问题,可以使用groupByKeymapValuesreduce来汇总小时和英里,这是一种方法:

val rdd = sc.parallelize(Seq(
  (13, (49,2643)),
  (13, (56,2553)),
  (13, (60,2539)),
  (14, (40,1500)),
  (14, (50,2500))
))

rdd.groupByKey.mapValues( _.reduce( (a, x) => (a._1 + x._1, a._2 + x._2) ) )
// res1: Array[(Int, (Int, Int))] = Array((13,(165,7735)), (14,(90,4000)))

或者如评论者所指出,如果您不需要reduceByKey的中间结果,则直接使用groupByKey进行汇总:

rdd.reduceByKey( (a, x) => (a._1 + x._1, a._2 + x._2) )