Question

我有RDD[String]，wordRDD。我还有一个从字符串/单词创建RDD [String]的函数。我想在wordRDD中为每个字符串创建一个新的RDD 。以下是我的尝试：

1）失败，因为Spark不支持嵌套的RDD：

var newRDD = wordRDD.map( word => { // execute myFunction() (new MyClass(word)).myFunction() })

2）失败（可能是由于范围问题？）：

var newRDD = sc.parallelize(new Array[String](0)) val wordArray = wordRDD.collect for (w <- wordArray){ newRDD = sc.union(newRDD,(new MyClass(w)).myFunction()) }

我的理想结果如下：

// input RDD (wordRDD) wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...) // myFunction behavior new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl') // after executing myFunction() on each word in wordRDD: newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)

我在这里找到了一个相关的问题：Spark when union a lot of RDD throws stack overflow error，但它没有解决我的问题。

Answer 1

您无法在另一个RDD内创建RDD。

但是，可以重写函数myFunction: String => RDD[String]，它将输入中删除了一个字母的所有单词生成另一个函数modifiedFunction: String => Seq[String]，以便可以在RDD中使用它。这样，它也将在您的群集上并行执行。获得modifiedFunction后，只需致电RDD即可获得包含所有字词的最终wordRDD.flatMap(modifiedFunction)。

关键点是使用flatMap（转化为map和flatten）：

def main(args: Array[String]) {
  val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
  val sc = new SparkContext(sparkConf)

  val input = sc.parallelize(Seq("apple", "ananas", "banana"))

  // RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
  val result = input.flatMap(modifiedFunction) 
}

def modifiedFunction(word: String): Seq[String] = {
  word.indices map {
    index => word.substring(0, index) + word.substring(index+1)
  }
}

Answer 2

根据需要使用>= totalslide获取var nextPager = ( activePager >= totalslide )? 1 : activePager + 1; /* equivalent to: if ( activePager >= totalslide ) // if we have reached the last slide, we come back to the first one nextPager = 1; else // if not, we go to the next one nextPager = activePager + 1; */ // Invoking setActiveSlide function passing nextPager (the index of the next slide) setActiveSlide( nextPager );。

flatMap

如何从RDD创建RDD集合？

2 个答案: