如何避免"无效的检查点目录" apache Spark中的错误?

时间:2015-04-17 22:35:36

标签: scala apache-spark directed-acyclic-graphs spark-streaming

我使用Amazon EMR + S3作为我的火花群集基础架构。当我使用定期检查点运行作业时(它具有长依赖关系树,因此必须通过检查点进行截断,每个检查点都有320个分区)。工作中途停止,导致异常:

(On driver)
org.apache.spark.SparkException: Invalid checkpoint directory: s3n://spooky-checkpoint/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198
    at org.apache.spark.rdd.CheckpointRDD.getPartitions(CheckpointRDD.scala:54)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
...
(On Executor)
15/04/17 22:00:14 WARN StorageService: Encountered 4 Internal Server error(s), will retry in 800ms
15/04/17 22:00:15 WARN RestStorageService: Retrying request following error response: PUT '/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025' -- ResponseCode: 500, ResponseStatus: Internal Server Error
...

手动检查检查点文件后,我发现S3上确实缺少/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025。所以我的问题是:如果它丢失(可能是由于AWS故障),为什么不在检查点过程中立即检测到它(因此可以重试),而不是抛出不可恢复的错误,说明依赖树是已经失去了?如何避免这种情况再次发生?

0 个答案:

没有答案