我有一个RDD [String,Array [String,Int]],
["abc",[("asd",1),("asd",3),("cvd",2),("cvd",2),("xyz",1)]]
我想把它变成-
["abc",[("asd",4),("cvd",4),("xyz",1)]]
我尝试过-
val y=hashedRdd.map(f=> (f._1,f._2.map(_._2).reduce((a,b)=>a+b)))
但这将返回RDD [String,Int] 我想返回RDD [String,Array [String,Int]]
答案 0 :(得分:1)
您可以对Array
进行分组并计算值sum
。
// Raw rdd
val hashedRdd = spark.sparkContext.parallelize(Seq(
("abc",Array(("asd",1),("asd",3),("cvd",2),("cvd",2),("xyz",1)))
))
//Group by first value and calculate the sum
val y = hashedRdd.map(x => {
(x._1, x._2.groupBy(_._1).mapValues(_.map(_._2).sum))
})
输出:
y.foreach(println)
(abc,Map(xyz -> 1, asd -> 4, cvd -> 4))
希望这会有所帮助!
答案 1 :(得分:0)
一种方法是在reduce
(第一项)之后的元组上groupBy
:
@ hashedRdd.map { f => (f._1, f._2.groupBy{ _._1 }.map{ _._2.reduce{ (a,b)=>(a._1, a._2+b._2) } } )}.collect
res11: Array[(String, Map[String, Int])] = Array(("abc", Map("xyz" -> 1, "asd" -> 4, "cvd" -> 4)))