我们使用Spark 2.2.0和Mesos进行资源管理。最近,对于Spark在集群模式下提交的应用程序,我们开始在调度程序日志中看到REASON_COMMAND_EXECUTOR_FAILED
在看到这个日志之后,我们发现尽管Mesos有足够的资源,工作也经常排队
以下是Dispatcher Application
18/04/25 10:11:05 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 10:24:23 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 10:24:56 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 10:26:25 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 10:36:12 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 10:37:12 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 10:39:25 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 11:02:45 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 11:39:50 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 11:43:44 INFO MesosClusterScheduler: Reviving Offers.
18/04/25 14:20:30 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425110245-0052 state=TASK_RUNNING message= reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:20:32 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425114344-0054 state=TASK_RUNNING message= reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:20:33 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425113950-0053 state=TASK_RUNNING message= reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:20:36 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425102625-0048 state=TASK_RUNNING message= reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:20:36 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425103612-0049 state=TASK_RUNNING message= reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:20:38 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425103925-0051 state=TASK_RUNNING message= reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:20:42 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425103712-0050 state=TASK_RUNNING message= reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:21:16 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425110245-0052 state=TASK_FINISHED message=Container exited with status 0 reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:21:19 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425113950-0053 state=TASK_FINISHED message=Container exited with status 0 reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:21:21 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425114344-0054 state=TASK_FINISHED message=Container exited with status 0 reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:21:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425102625-0048 state=TASK_FINISHED message=Container exited with status 0 reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:21:46 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425103612-0049 state=TASK_FINISHED message=Container exited with status 0 reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:21:47 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425103712-0050 state=TASK_FINISHED message=Container exited with status 0 reason=REASON_COMMAND_EXECUTOR_FAILED
18/04/25 14:24:12 INFO MesosClusterScheduler: Received status update: taskId=driver-20180425103925-0051 state=TASK_FINISHED message=Container exited with status 0 reason=REASON_COMMAND_EXECUTOR_FAILED
我找不到此消息的原因。虽然工作成功 - 但这个错误来了。但令我们担心的是,工作岗位正在排队。