我是Scala和Spark世界的新手,因此需要一点帮助。
我有两个值,都-RDD[((String, String), Double)]
和类似的值:-
RDD1 =
((a, b), 10)
((c, d), 20)
((g, h),50)
RDD2 =
((a, b), 20)
((e, f), 30)
((g, h), 10)
,所需的输出是:-
(a, b, 30)
(c, d, 20)
(e, f, 30)
(g, h, 60)
很抱歉由于某些政策而发布了模拟数据,但非常感谢您的帮助。
我尝试过:-
val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y+z)}
but seems I'm making some mistake. It shows error that:-
[error] ...../class.scala:59: constructor cannot be instantiated to expected type;
[error] found : (T1, T2, T3)
[error] required: ((String, String), (Option[Double], Option[Double]))
[error] val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y._1+z._1)}
[error] ^
[error] ...../class.scala:59: not found: value x
[error] val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y._1+z._1)}
[error] ^
[error] ...../class.scala:59: not found: value x
[error] val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y._1+z._1)}
[error] ^
[error] ...../class.scala:59: not found: value y
[error] val joined = rdd1.fullOuterJoin(rdd2).map{case(x, y, z) => (x._1, x._2, y._1+z._1)}
[error] ^
[error] four errors found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 3 s, completed 25 Jul, 2018 6:54:09 PM
任何帮助将不胜感激
答案 0 :(得分:0)
请尝试以下操作,因为double
的值为Options
:
val rdd1 = sc.parallelize(Seq((("a","b"),10.0),
(("c","d"),20.0),
(("g","h"),50.0)))
val rdd2 = sc.parallelize(Seq((("a","b"),20.0),
(("e","f"),30.0),
(("g","h"),10.0)))
rdd1.fullOuterJoin(rdd2).map {case ((x1, x2), (y1, y2)) => (x1,x2,y1.getOrElse(0.0) + y2.getOrElse(0.0))}.collect.foreach(println)
//((g,h),60)
//((c,d),20)
//((e,f),30)
//((a,b),30)
答案 1 :(得分:0)
这是一个简单的单行解决方案,而无需加入RDD:
val rdd1 = sc.parallelize(Seq((("a","b"),10.0),
(("c","d"),20.0),
(("g","h"),50.0)))
val rdd2 = sc.parallelize(Seq((("a","b"),20.0),
(("e","f"),30.0),
(("g","h"),10.0)))
然后我们合并不同RDD的值,并使用reduceByKey对每个键的值求和:
val result = (rdd1 union rdd2).reduceByKey(_ + _)
这等效于以下行:
val result = (rdd1 union rdd2).reduceByKey((x,y) => x+y)
让我们检查最终结果的输出:
result.foreach(println)
//((e,f),30.0)
//((a,b),30.0)
//((g,h),60.0)
//((c,d),20.0)
希望对您有帮助!