我使用Future来对RDD执行阻塞操作,如下所示:
Dim allLists As List(Of List(Of Int32)) = PobierzProduktyKategorii("Foo", categoryID)
Dim valueExists As Boolean = allLists.SelectMany(Function(list) list).Contains(value)
有时我会收到此错误:
dStreams.foreach(_.foreachRDD { rdd =>
Future{ writeRDD(rdd) }
})
看起来Spark很难知道何时应删除此RDD。
为什么会发生这种情况,解决方案是什么?
更新
我认为RDD在使用之前可能是GC。到目前为止,唯一可行的解决方案是设置
org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: org.apache.spark.SparkException: Attempted to use BlockRDD[820] at actorStream at Tests.scala:149 after its blocks have been removed!
并conf.set("spark.streaming.unpersist", "false")
- 手动。
如果这是一个错误,则为完整堆栈跟踪:
unpersist()
答案 0 :(得分:2)
我认为问题在于,在writeRDD(rdd)
内的代码执行之前(因为它在Future
中),rdd(或微批处理RDD)已被Apache Spark内存回收 - 管理或BlockManager
。
因此,此错误
org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: org.apache.spark.SparkException: Attempted to use BlockRDD[820] at actorStream at Tests.scala:149 after its blocks have been removed!
您可以先收集微批次集合,然后将其传递给writeRDD()
函数来解决此问题。像这样:
dStreams.foreach(_.foreachRDD { rdd =>
val coll = rdd.collect()
Future{ writeCollection(coll) }
})