我有3个主人,5个奴隶mesos设置。 服务器可以正常通信,选择主设备,从设备连接顺畅。但是任何闲置且没有首先运行应用程序的从属设备都会进行健康检查失败"在主人身上(奴隶不会抱怨任何东西或失去联系,我想)然后一段时间后主人抱怨"来自未知奴隶的状态更新"并终止奴隶。这种情况发生在所有空闲的从属设备上,而那些有进程的设备可以继续工作而没有问题。
有谁知道如何解决这个问题?
附上"摘录"奴隶的日志。我试着把它清理一下
I0225 18:02:14.077440 9029 slave.cpp:3053] Current usage 60.93%. Max allowed age: 2.035008507120139days
I0225 18:02:28.615249 9025 slave.cpp:2088] Handling status update TASK_KILLED (UUID: id) for task develop.id of framework fwid from executor(1)@ip1:45193
W0225 18:02:28.615352 9025 slave.cpp:2121] Could not find the executor for status update TASK_KILLED (UUID: id) for task develop.id of framework fwid
I0225 18:02:28.615947 9031 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: id) for task develop.id of framework fwid
I0225 18:02:28.616165 9031 status_update_manager.cpp:373] Forwarding status update TASK_KILLED (UUID: id) for task develop.id of framework fwid to master@ip2:5050
I0225 18:02:28.616334 9031 slave.cpp:2252] Sending acknowledgement for status update TASK_KILLED (UUID: id) for task develop.id of framework fwid to executor(1)@ip1:45193
I0225 18:02:28.618074 9025 slave.cpp:508] Slave asked to shut down by master@ip2:5050 because 'Status update from unknown slave'
I0225 18:02:28.618239 9025 slave.cpp:1406] Asked to shut down framework fwid by master@ip2:5050
I0225 18:02:28.618273 9025 slave.cpp:1431] Shutting down framework fwid
I0225 18:02:28.618387 9025 slave.cpp:2878] Shutting down executor 'develop.id' of framework fwid
I0225 18:02:29.336168 9027 slave.cpp:2088] Handling status update TASK_KILLED (UUID: id) for task develop.id of framework fwid from executor(1)@ip1:42376
W0225 18:02:29.336278 9027 slave.cpp:2112] Ignoring status update TASK_KILLED (UUID: id) for task develop.id of framework fwid for terminating framework fwid
I0225 18:02:30.338100 9030 containerizer.cpp:997] Executor for container 'id' has exited
I0225 18:02:30.338213 9030 containerizer.cpp:882] Destroying container 'id'
I0225 18:02:30.343300 9025 slave.cpp:2596] Executor 'develop.id' of framework fwid exited with status 0
I0225 18:02:30.343474 9025 slave.cpp:2732] Cleaning up executor 'develop.id' of framework fwid
I0225 18:02:30.343935 9029 gc.cpp:56] Scheduling '/mnt/spark/mesos/slaves/S12/frameworks/fwid/executors/develop.id/runs/id' for gc 6.99999602148148days in the future
I0225 18:02:30.344023 9025 slave.cpp:2807] Cleaning up framework fwid
I0225 18:02:30.344100 9029 gc.cpp:56] Scheduling '/mnt/spark/mesos/slaves/S12/frameworks/fwid/executors/develop.id' for gc 6.9999960201037days in the future
I0225 18:02:30.344174 9029 gc.cpp:56] Scheduling '/mnt/spark/mesos/meta/slaves/S12/frameworks/fwid/executors/develop.id/runs/id' for gc 6.99999601960593days in the future
I0225 18:02:30.344216 9025 slave.cpp:466] Slave terminating
答案 0 :(得分:1)
"健康检查失败"消息意味着主人在过去一分半钟内无法PING奴隶(或至少没有收到其PONG)。你有间歇性的网络问题吗?您是否尝试过从主服务器(和v.v.)中ping从服务器?端口5051(或您使用的端口)的从机是否存在防火墙问题?