I am using AWS emr-5.0.0 to run a small cluster that consists of the following notes:
All of them are x3.xlarge machines.
I run a python spark application with two stages.
The problem is that when I manually terminate one of the TASK instances (or it gets terminated due to spot price change) the entire spark job fails.
I would expect that SPARK would just continue running the lost tasks on remaining nodes. Please explain why it does not happen.
Below is the log, master ip is 172-31-1-0, core instance it is 172-31-1-173, the lost not ip is 172-31-3-81).