Spark RDD Block在使用前删除

时间:2015-10-12 09:33:32

标签: scala apache-spark spark-streaming

我使用Future来对RDD执行阻塞操作,如下所示:

Dim allLists As List(Of List(Of Int32)) = PobierzProduktyKategorii("Foo", categoryID)
Dim valueExists As Boolean = allLists.SelectMany(Function(list) list).Contains(value)

有时我会收到此错误:

dStreams.foreach(_.foreachRDD { rdd =>

  Future{ writeRDD(rdd) }

})

看起来Spark很难知道何时应删除此RDD。

为什么会发生这种情况,解决方案是什么?

更新

我认为RDD在使用之前可能是GC。到目前为止,唯一可行的解​​决方案是设置

org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: org.apache.spark.SparkException: Attempted to use BlockRDD[820] at actorStream at Tests.scala:149 after its blocks have been removed!

conf.set("spark.streaming.unpersist", "false") - 手动。

如果这是一个错误,则为完整堆栈跟踪:

unpersist()

1 个答案:

答案 0 :(得分:2)

我认为问题在于,在writeRDD(rdd)内的代码执行之前(因为它在Future中),rdd(或微批处理RDD)已被Apache Spark内存回收 - 管理或BlockManager

因此,此错误

org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: org.apache.spark.SparkException: Attempted to use BlockRDD[820] at actorStream at Tests.scala:149 after its blocks have been removed!

您可以先收集微批次集合,然后将其传递给writeRDD()函数来解决此问题。像这样:

dStreams.foreach(_.foreachRDD { rdd =>

  val coll = rdd.collect()
  Future{ writeCollection(coll) }

})