我有一个群集,其中有5台mesos-master机器,法定人数为3。我在一次崩溃中所有5台机器都崩溃了,现在我想稳定一个新的master。我看到以下内容:
F0314 17:21:22.007699 8233 master.cpp:1176] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f80faa2de1d google::LogMessage::Fail()
@ 0x7f80faa2fd35 google::LogMessage::SendToLog()
@ 0x7f80faa2da3c google::LogMessage::Flush()
@ 0x7f80faa305a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f80fa4cded7 mesos::internal::master::fail()
@ 0x7f80fa4fc89b _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
@ 0x7f80fa430415 process::internal::run<>()
@ 0x7f80fa43049b process::Future<>::fail()
@ 0x7f80fa52f22e process::internal::thenf<>()
@ 0x7f80fa5737f5 _ZN7process8internal3runISt8functionIFvRKNS_6FutureIN5mesos8internal8RegistryEEEEEJRS7_EEEvRKSt6vectorIT_SaISE_EEDpOT0_
@ 0x7f80fa57388d process::Future<>::fail()
@ 0x7f80fa430415 process::internal::run<>()
@ 0x7f80fa57387b process::Future<>::fail()
@ 0x7f80fa56965c mesos::internal::master::RegistrarProcess::_recover()
@ 0x7f80fa9d88da process::ProcessManager::resume()
@ 0x7f80fa9d8b8c process::schedule()
@ 0x7f80f9b17192 start_thread
@ 0x7f80f8e1c26d (unknown)
我已采取的步骤:
谁能建议其他可以清除的内容,或打开哪些其他日志记录?应该以某种方式增加1分钟的获取时间(不确定是什么因素控制了此时间限制)。
在崩溃之前,我看到以下内容:
I0314 17:20:22.006703 8233 master.cpp:1187] Recovering from registrar
I0314 17:20:22.006728 8234 registrar.cpp:313] Recovering registrar
I0314 17:20:22.008124 8228 group.cpp:659] Trying to get '/mesos.cluster1/log_replicas/0000000040' in ZooKeeper
I0314 17:20:22.009953 8228 group.cpp:659] Trying to get '/mesos.cluster1/log_replicas/0000000041' in ZooKeeper
I0314 17:20:22.011700 8230 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@ip1:5050, log-replica(1)@ip2:5050, log-replica(1)@ip3:5050, log-replica(1)@ip4:5050 }
I0314 17:20:22.031610 8233 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0314 17:20:22.606565 8228 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0314 17:20:23.250629 8231 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0314 17:20:23.930461 8234 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0314 17:20:24.331779 8231 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0314 17:20:24.493544 8234 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0314 17:20:24.739902 8230 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0314 17:20:24.976166 8236 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
然后,空的ine重复很多次,直到主机如上所述死掉为止。如何让主人稳定下来几乎茫然了
最后,下面是我的mesos版本(目前我无法升级)
mesos-master --version
mesos 0.22.2