Question

我在foreachPartition()中使用多个线程，除了底层迭代器是TungstenAggregationIterator之外，它对我很有用。这是一个重现的最小代码片段：

    import scala.concurrent.ExecutionContext.Implicits.global
    import scala.concurrent.duration.Duration
    import scala.concurrent.{Await, Future}

    import org.apache.spark.SparkContext
    import org.apache.spark.sql.SQLContext

    object Reproduce extends App {

      val sc = new SparkContext("local", "reproduce")
      val sqlContext = new SQLContext(sc)

      import sqlContext.implicits._

      val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()

      df.foreachPartition { iterator =>
        val f = Future(iterator.toVector)
        Await.result(f, Duration.Inf)
      }
    }

当我跑步时，我得到：

    java.lang.NullPointerException
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

我相信我实际上理解为什么会发生这种情况 - TungstenAggregationIterator使用ThreadLocal变量，当从除了从Spark获得迭代器的原始线程之外的线程调用时返回null。通过检查代码，这似乎在最近的Spark版本之间没有区别。

但是，这个限制仅针对TungstenAggregationIterator，而且没有记录，据我所知。

有没有办法解决TungstenAggregationIterator的这种限制？任何相关文件？我有一个解决方法，但它非常hacky并且不必要地降低了运行时性能。

在Spark DataFrame中运行线程foreachPartition（）

0 个答案: