我在foreachPartition()
中使用多个线程,除了底层迭代器是TungstenAggregationIterator
之外,它对我很有用。这是一个重现的最小代码片段:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object Reproduce extends App {
val sc = new SparkContext("local", "reproduce")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
df.foreachPartition { iterator =>
val f = Future(iterator.toVector)
Await.result(f, Duration.Inf)
}
}
当我跑步时,我得到:
java.lang.NullPointerException
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
我相信我实际上理解为什么会发生这种情况 - TungstenAggregationIterator
使用ThreadLocal
变量,当从除了从Spark获得迭代器的原始线程之外的线程调用时返回null
。通过检查代码,这似乎在最近的Spark版本之间没有区别。
但是,这个限制仅针对TungstenAggregationIterator
,而且没有记录,据我所知。
有没有办法解决TungstenAggregationIterator
的这种限制?任何相关文件?我有一个解决方法,但它非常hacky并且不必要地降低了运行时性能。