在RDD上联接并映射值-构造函数无法实例化为预期的类型

时间:2018-07-25 13:29:24

标签: scala apache-spark rdd

我是Scala和Spark世界的新手,因此需要一点帮助。

我有两个值,都-RDD[((String, String), Double)]

和类似的值:-

RDD1 = 
((a, b), 10)
((c, d), 20)
((g, h),50)

RDD2 = 
((a, b), 20)
((e, f), 30)
((g, h), 10)

,所需的输出是:-

(a, b, 30)
(c, d, 20)
(e, f, 30)
(g, h, 60)

很抱歉由于某些政策而发布了模拟数据,但非常感谢您的帮助。

我尝试过:-

    val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y+z)}

but seems I'm making some mistake. It shows error that:-
[error] ...../class.scala:59: constructor cannot be instantiated to expected type;
[error]  found   : (T1, T2, T3)
[error]  required: ((String, String), (Option[Double], Option[Double]))
[error]       val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y._1+z._1)}
[error]                                                                    ^
[error] ...../class.scala:59: not found: value x
[error]       val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y._1+z._1)}
[error]                                                                                  ^
[error] ...../class.scala:59: not found: value x
[error]       val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y._1+z._1)}
[error]                                                                                        ^
[error] ...../class.scala:59: not found: value y
[error]       val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y._1+z._1)}
[error]                                                                                              ^
[error] four errors found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 3 s, completed 25 Jul, 2018 6:54:09 PM

任何帮助将不胜感激

2 个答案:

答案 0 :(得分:0)

请尝试以下操作,因为double的值为Options

val rdd1 = sc.parallelize(Seq((("a","b"),10.0),
                          (("c","d"),20.0),
                          (("g","h"),50.0)))

val rdd2 = sc.parallelize(Seq((("a","b"),20.0),
                          (("e","f"),30.0),
                          (("g","h"),10.0)))

rdd1.fullOuterJoin(rdd2).map {case ((x1, x2), (y1, y2))  => (x1,x2,y1.getOrElse(0.0) + y2.getOrElse(0.0))}.collect.foreach(println)

//((g,h),60)
//((c,d),20)
//((e,f),30)
//((a,b),30)

答案 1 :(得分:0)

这是一个简单的单行解决方案,而无需加入RDD:

val rdd1 = sc.parallelize(Seq((("a","b"),10.0),
                      (("c","d"),20.0),
                      (("g","h"),50.0)))
val rdd2 = sc.parallelize(Seq((("a","b"),20.0),
                      (("e","f"),30.0),
                      (("g","h"),10.0)))

然后我们合并不同RDD的值,并使用reduceByKey对每个键的值求和:

val result = (rdd1 union rdd2).reduceByKey(_ + _)

这等效于以下行:

val result = (rdd1 union rdd2).reduceByKey((x,y) => x+y)

让我们检查最终结果的输出:

result.foreach(println)

//((e,f),30.0)
//((a,b),30.0)
//((g,h),60.0)
//((c,d),20.0)

希望对您有帮助!