我有一个单一的RDD,形式为:
nameResolvedFromHashes: RDD[(Node, String, Option[String], Option[String], Map[String, String])]
我的RDD的示例数据如下:
(<MyXml1>,{MyJson1},Some(1),Some(2),Map(hash1 -> value1))
(<MyXml1>,{MyJson1},Some(1),Some(2),Map(hash2 -> value2))
(<MyXml2>,{MyJson2},Some(3),Some(4),Map(hash3 -> value3))
我想得到这样的东西,即如果第一个4个元素相同,则加入元组_5元素的映射。
输出:
(<MyXml1>,{MyJson1},Some(1),Some(2),Map(hash1 -> value1,hash2 -> value2))
(<MyXml2>,{MyJson2},Some(3),Some(4),Map(hash3 -> value3))
我尝试过:
nameResolvedFromHashes.map(tup => ((tup._1,tup._2,tup._3,tup._4), tup._5)).reduceByKey { case (a, _) => a }.map(_._2)
但是它只给出了我输入的第2行和第3行作为输出。请帮忙。
答案 0 :(得分:1)
我不明白您的reduceByKey
-> map
步骤的逻辑。使用单个reduceByKey
对地图求和似乎可以实现您的目标。有什么我想念的吗?
scala> val in = Seq(("a", "b", 1, 2, Map((1 -> "c"))),("a", "b", 1, 2, Map((2 -> "d"))),("e", "f", 1, 2, Map((1 -> "g"))))
in: Seq[(String, String, Int, Int, scala.collection.immutable.Map[Int,String])] = List((a,b,1,2,Map(1 -> c)), (a,b,1,2,Map(2 -> d)), (e,f,1,2,Map(1 -> g)))
scala> val rdd = spark.sparkContext.parallelize(in)
rdd: org.apache.spark.rdd.RDD[(String, String, Int, Int, scala.collection.immutable.Map[Int,String])] = ParallelCollectionRDD[14] at parallelize at <console>:25
scala> val done = rdd.map(tup => ((tup._1,tup._2,tup._3,tup._4), tup._5)).reduceByKey { _ ++ _ }.map(tup => (tup._1._1, tup._1._2, tup._1._3, tup._1._4, tup._2)).map{case ((a, b, c, d), e) => (a,b,c,d,e)}`
done: org.apache.spark.rdd.RDD[(String, String, Int, Int, scala.collection.immutable.Map[Int,String])] = ShuffledRDD[16] at reduceByKey at <console>:25
scala> done foreach println
(a,b,1,2,Map(1 -> c, 2 -> d))
(e,f,1,2,Map(1 -> g))