why EMR SPARK job fails when task group node is lost?

时间:2016-09-01 06:27:48

标签: apache-spark pyspark yarn emr

I am using AWS emr-5.0.0 to run a small cluster that consists of the following notes:

  • 1 Master - AWS on demand instance
  • 1 CORE - AWS on demand instance
  • 2 TASK - AWS SPOT instance

All of them are x3.xlarge machines.

I run a python spark application with two stages.

The problem is that when I manually terminate one of the TASK instances (or it gets terminated due to spot price change) the entire spark job fails.

I would expect that SPARK would just continue running the lost tasks on remaining nodes. Please explain why it does not happen.

Below is the log, master ip is 172-31-1-0, core instance it is 172-31-1-173, the lost not ip is 172-31-3-81).

log file (stderr and stdout from spark-submit)

0 个答案:

没有答案