我使用Amazon EMR + S3作为我的火花群集基础架构。当我使用定期检查点运行作业时(它具有长依赖关系树,因此必须通过检查点进行截断,每个检查点都有320个分区)。工作中途停止,导致异常:
(On driver)
org.apache.spark.SparkException: Invalid checkpoint directory: s3n://spooky-checkpoint/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198
at org.apache.spark.rdd.CheckpointRDD.getPartitions(CheckpointRDD.scala:54)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
...
(On Executor)
15/04/17 22:00:14 WARN StorageService: Encountered 4 Internal Server error(s), will retry in 800ms
15/04/17 22:00:15 WARN RestStorageService: Retrying request following error response: PUT '/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025' -- ResponseCode: 500, ResponseStatus: Internal Server Error
...
手动检查检查点文件后,我发现S3上确实缺少/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025
。所以我的问题是:如果它丢失(可能是由于AWS故障),为什么不在检查点过程中立即检测到它(因此可以重试),而不是抛出不可恢复的错误,说明依赖树是已经失去了?如何避免这种情况再次发生?