Question

我使用Amazon EMR + S3作为我的火花群集基础架构。当我使用定期检查点运行作业时（它具有长依赖关系树，因此必须通过检查点进行截断，每个检查点都有320个分区）。工作中途停止，导致异常：

(On driver)
org.apache.spark.SparkException: Invalid checkpoint directory: s3n://spooky-checkpoint/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198
    at org.apache.spark.rdd.CheckpointRDD.getPartitions(CheckpointRDD.scala:54)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
...
(On Executor)
15/04/17 22:00:14 WARN StorageService: Encountered 4 Internal Server error(s), will retry in 800ms
15/04/17 22:00:15 WARN RestStorageService: Retrying request following error response: PUT '/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025' -- ResponseCode: 500, ResponseStatus: Internal Server Error
...

手动检查检查点文件后，我发现S3上确实缺少/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025。所以我的问题是：如果它丢失（可能是由于AWS故障），为什么不在检查点过程中立即检测到它（因此可以重试），而不是抛出不可恢复的错误，说明依赖树是已经失去了？如何避免这种情况再次发生？

如何避免＆＃34;无效的检查点目录＆＃34; apache Spark中的错误？

0 个答案: