我正在尝试在spark中加入2 PairRDD并且不确定如何迭代结果。
val input1 = sc.textFile(inputFile1)
val input2 = sc.textFile(inputFile2)
val pairs = input1.map(x => (x.split("\\|")(18),x))
val groupPairs = pairs.groupByKey()
val staPairs = input2.map(y => (y.split("\\|")(0),y))
val stagroupPairs = staPairs.groupByKey()
val finalJoined = groupPairs.leftOuterJoin(stagroupPairs)
org.apache.spark.rdd.RDD[(String, (Iterable[String], Option[Iterable[String]]))]
当我finalJoined.collect().foreach(println)
时,我看到以下输出:
(key1,(CompactBuffer(val1a,val1b),Some(CompactBuffer(val1)))
(key2,(CompactBuffer(val2a,val2b),Some(CompactBuffer(val2)))
我希望输出为
key1
val1a+"|"+val1
val1b+"|"+val1
key2
val2a+"|"+val2
答案 0 :(得分:0)
避免在两个rdds上执行groupByKey步骤并直接对对和starpairs执行连接。您将获得所需的结果。
例如,
val rdd1 = sc.parallelize(Array("key1,val1a","key1,val1b","key2,val2a","key2,val2b").toSeq)
val rdd2 = sc.parallelize(Array("key1,val1","key2,val2").toSeq)
val pairs= rdd1.map(_.split(",")).map(x => (x(0),x(1)))
val starPairs= rdd2.map(_.split(",")).map(x => (x(0),x(1)))
val res = pairs.join(starPairs)
res.foreach(println)
(key1,(val1a,val1))
(key1,(val1b,val1))
(key2,(val2a,val2))
(key2,(val2b,val2))