RDD [x:Vector [String]]到RDD [x:Vector [String] + Iterator:Vector [String]]

时间:2016-11-15 18:20:15

标签: scala apache-spark rdd

DocsRDD:

RDD[Vector[String]]

DocsRDD:

Vector(Doc1.txt, Doc2.txt, Doc5.txt)
Vector(Doc4.txt, Doc3.txt)
Vector(Doc6.txt, Doc9.txt)

我想要的只是所有双对文档,例如我想要的DocsRDD

AllDualDocsRDD:

Vector(Doc1.txt, Doc2.txt)
Vector(Doc1.txt, Doc5.txt)
Vector(Doc2.txt, Doc5.txt)
Vector(Doc4.txt, Doc3.txt)
Vector(Doc6.txt, Doc9.txt)

以下是我的代码示例(我是Spark,Scala的新手)。

val AllDualDocsRDD = DocsRDD.map(e => if (e.size > 2) {
                            val V_iter = (1 to e.size).flatMap(e.combinations).filter(_.size == 2).toVector
                            V_iter.foreach(println)
                            //Here I Cannot put V_iter : scala.Vector[Vector[String]]
                        }
                        else e)

但似乎我已经碰壁了!有谁知道我怎么能做到这一点?

2 个答案:

答案 0 :(得分:0)

尝试:

sc.parallelize(
  Seq(Vector("Doc1.txt", "Doc2.txt", "Doc5.txt"))
).flatMap(v => v.combinations(Math.min(v.size, 2)))

答案 1 :(得分:0)

RDD上的直接平面地图怎么样:

val AllDualDocsRDD = DocsRDD.flatMap(  x => {
                                     if( x.size > 2){
                                            x.combinations(2).toSeq
                                        }
                                     else Seq(x)
                                    }
                            )

这样做了。