检查RDD是否包含相同的密钥,如果是,则将其合并

时间:2018-08-16 23:13:07

标签: scala apache-spark

我有一个RDD [(String,Map [String,Int])],

   [("A",Map("acs"->2,"sdv"->2,"sfd"->1),("B",Map("ass"->2,"fvv"->2,"ffd"->1)),("A"),Map("acs"->2,"sdv"->2,"sfd"->1)]

我想用相同的键合并元素,

    [("A",Map("acs"->4,"sdv"->4,"sfd"->2),("B",Map("ass"->2,"fvv"->2,"ffd"->1))]

如何在scala中做到这一点?

2 个答案:

答案 0 :(得分:3)

如果您定义mapSum(请参阅merge two maps and sum values

def mapSum[T](map1: Map[T, Int], map2: Map[T, Int]): Map[T, Int] = map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }

然后,您可以分组并减少(类似于您的其他问题):

@ rdd.groupBy(_._1).map(_._2.reduce((a, b) => (a._1, mapSum(a._2, b._2)))).collect
res11: Array[(String, Map[String, Int])] = Array(
  ("A", Map("acs" -> 4, "sdv" -> 4, "sfd" -> 2)),
  ("B", Map("ass" -> 2, "fvv" -> 2, "ffd" -> 1))
)

答案 1 :(得分:2)

一种有效的方法是使用reduceByKey通过对匹配键的值求和来汇总Map(在累加器中):

val rdd = sc.parallelize(Seq(
  ("A", Map("acs"->2, "sdv"->2, "sfd"->1)),
  ("B", Map("ass"->2, "fvv"->2, "ffd"->1)),
  ("A", Map("acs"->2, "sdv"->2, "sfd"->1))
))

rdd.reduceByKey( (acc, m) =>
  acc ++ m.map{ case (k, v) => (k, acc.getOrElse(k, 0) + v) }
).collect

// res1: Array[(String, scala.collection.immutable.Map[String,Int])] = Array(
//   (A,Map(acs -> 4, sdv -> 4, sfd -> 2)),
//   (B,Map(ass -> 2, fvv -> 2, ffd -> 1))
// )