Question

我一直在尝试使用Spark的mapPartitionsWithIndex，但是当我遇到问题时尝试返回本身包含一个空迭代器的元组的迭代器。

我尝试了几种不同的构造内部迭代器的方法[通过Iterator（）和List（...）。iterator]，以及所有的道路让我得到这个错误：

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0 in stage 0.0 (TID 2) had a not serializable result: scala.collection.LinearSeqLike$$anon$1
Serialization stack:
        - object not serializable (class: scala.collection.LinearSeqLike$$anon$1, value: empty iterator)
        - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
        - object (class scala.Tuple2, (1,empty iterator))
        - element of array (index: 0)
        - array (class [Lscala.Tuple2;, size 1)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)

下面是我的代码示例。请注意，按给定的方式运行就可以了（返回一个空的迭代器作为 mapPartitionsWithIndex值。）但是当您使用已注释掉版本的在mapPartitionsWithIndex调用中，您将得到上面的错误。

如果有人对如何使它起作用的建议，我将非常有义务。

import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object ANonWorkingExample extends App {
  val sparkConf = new SparkConf().setAppName("continuous").setMaster("local[*]")
  val sc = new SparkContext(sparkConf)
  val parallel: RDD[Int] = sc.parallelize(1 to 9)
  val parts: Array[Partition] = parallel.partitions

  val partRDD: RDD[(Int, Iterator[Int])] =
    parallel.coalesce(3).
      mapPartitionsWithIndex {
        (partitionIndex: Int, inputiterator: Iterator[Int]) =>
          val mappedInput: Iterator[Int] = inputiterator.map(_ + 1)
          // Iterator((partitionIndex, mappedInput)) // FAILS
          Iterator()   // no exception.. but not really what i want.

      }

  val data = partRDD.collect
  println("data:" + data.toList);
}

Answer 1

我不确定您要达到的目标，与这里的一些专家相比，我是一个新手。

我介绍一些可以让您了解如何正确思考并发表评论的想法：

您似乎明确地获得了分区，并调用mapPartitions-对我来说是第一个。
mapPartitions和各种SPARK SCALA中的RDD不会飞行；它与可迭代项有关，我认为您需要降至SCALA级别。
可序列化的错误来自执行List [Int]。

这里是显示索引分区以及那些相应索引值的示例。

import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}
// from your stuff, left in

val parallel: RDD[Int] = sc.parallelize(1 to 9, 4)
val mapped =   parallel.mapPartitionsWithIndex{
                       (index, iterator) => {
                          println("Called in Partition -> " + index)
                          val myList = iterator.toList                          
                          myList.map(x => (index, x)).groupBy( _._1 ).mapValues( _.map( _._2 ) ).toList.iterator
                       }
                 }  
mapped.collect()

这将返回类似于我认为您想要的内容的以下内容：

res38: Array[(Int, List[Int])] = Array((0,List(1, 2)), (1,List(3, 4)), (2,List(5, 6)), (3,List(7, 8, 9)))

最后的提示：文档等并不是那么容易理解，您无法从字数统计示例中获得全部信息！

因此，希望对您有所帮助。

我认为这可能会让您走上正确的路，我看不到它，但是也许您现在可以看到树木茂密的森林。

Answer 2

因此，我正在做的愚蠢的事情是试图返回不可序列化的数据结构：一个Iterator，正如我得到的堆栈跟踪所清楚表明的那样。

解决方案是不使用迭代器。而是使用诸如Seq或List的集合。下面的示例程序说明了执行我尝试做的正确方法。

import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object AWorkingExample extends App {
  val sparkConf = new SparkConf().setAppName("batman").setMaster("local[*]")
  val sc = new SparkContext(sparkConf)
  val parallel: RDD[Int] = sc.parallelize(1 to 9)
  val parts: Array[Partition] = parallel.partitions

  val partRDD: RDD[(Int, List[Int])] =
    parallel.coalesce(3).
      mapPartitionsWithIndex {
        (partitionIndex: Int, inputiterator: Iterator[Int]) =>
          val mappedInput: Iterator[Int] = inputiterator.map(_ + 1)
          Iterator((partitionIndex, mappedInput.toList)) // Note the .toList() call -- that makes it work
      }

  val data = partRDD.collect
  println("data:" + data.toList);
}

顺便说一句，我最初试图做的是具体查看我的并行化到RDD结构中的哪些数据块分配给了哪个分区。这是您运行程序后得到的输出：

data：List（（0，List（2，3）），（1，List（4，5，6）），（2，List（7，8，9，10）））

有趣的是，数据分布本来可以达到最佳平衡，但事实并非如此。这不是问题的重点，但我认为这很有趣。

想知道为什么空的内部迭代器会导致mapPartitionsWithIndex

2 个答案: