在Spark DataFrame中运行线程foreachPartition()

时间:2017-01-16 10:18:13

标签: multithreading scala apache-spark apache-spark-sql

我在foreachPartition()中使用多个线程,除了底层迭代器是TungstenAggregationIterator之外,它对我很有用。这是一个重现的最小代码片段:

    import scala.concurrent.ExecutionContext.Implicits.global
    import scala.concurrent.duration.Duration
    import scala.concurrent.{Await, Future}

    import org.apache.spark.SparkContext
    import org.apache.spark.sql.SQLContext

    object Reproduce extends App {

      val sc = new SparkContext("local", "reproduce")
      val sqlContext = new SQLContext(sc)

      import sqlContext.implicits._

      val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()

      df.foreachPartition { iterator =>
        val f = Future(iterator.toVector)
        Await.result(f, Duration.Inf)
      }
    }

当我跑步时,我得到:

    java.lang.NullPointerException
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

我相信我实际上理解为什么会发生这种情况 - TungstenAggregationIterator使用ThreadLocal变量,当从除了从Spark获得迭代器的原始线程之外的线程调用时返回null。通过检查代码,这似乎在最近的Spark版本之间没有区别。

但是,这个限制仅针对TungstenAggregationIterator,而且没有记录,据我所知。

有没有办法解决TungstenAggregationIterator的这种限制?任何相关文件?我有一个解决方法,但它非常hacky并且不必要地降低了运行时性能。

0 个答案:

没有答案