Question

我使用Future来对RDD执行阻塞操作，如下所示：

Dim allLists As List(Of List(Of Int32)) = PobierzProduktyKategorii("Foo", categoryID)
Dim valueExists As Boolean = allLists.SelectMany(Function(list) list).Contains(value)

有时我会收到此错误：

dStreams.foreach(_.foreachRDD { rdd =>

  Future{ writeRDD(rdd) }

})

看起来Spark很难知道何时应删除此RDD。

为什么会发生这种情况，解决方案是什么？

更新

我认为RDD在使用之前可能是GC。到目前为止，唯一可行的解决方案是设置

org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: org.apache.spark.SparkException: Attempted to use BlockRDD[820] at actorStream at Tests.scala:149 after its blocks have been removed!

并conf.set("spark.streaming.unpersist", "false") - 手动。

如果这是一个错误，则为完整堆栈跟踪：

unpersist()

Answer 1

我认为问题在于，在writeRDD(rdd)内的代码执行之前（因为它在Future中），rdd（或微批处理RDD）已被Apache Spark内存回收 - 管理或BlockManager。

因此，此错误

org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: org.apache.spark.SparkException: Attempted to use BlockRDD[820] at actorStream at Tests.scala:149 after its blocks have been removed!

您可以先收集微批次集合，然后将其传递给writeRDD()函数来解决此问题。像这样：

dStreams.foreach(_.foreachRDD { rdd =>

  val coll = rdd.collect()
  Future{ writeCollection(coll) }

})

Spark RDD Block在使用前删除

1 个答案: