Mesosphere - 高可用性群集未能选出领导者,但是日志没有显示错误,似乎无法强制进行领导者选举

时间:2015-10-02 15:21:55

标签: ubuntu-14.04 apache-zookeeper mesos mesosphere marathon

我有一个6机器群。机器是:

      HOST        MEM (GB) CPU
mesos-primary-1     8       2
mesos-primary-2     8       2
mesos-primary-3     8       2
mesos-worker-1      1       1
mesos-worker-2      1       1
mesos-worker-3      1       1

我的法定人数大小设为2。

主机的id分别为:1,2和3。 在网络用户界面中,我访问了端口5050上的mesos-primary-1mesos-primary-2mesos-primary-3的每个IP,并且我没有收到任何来自任何一个IP的另一个IP的重定向它们。

缺乏重定向导致我相信每台机器都认为它拥有自己的法定人数或其他东西,这就是为什么他们没有看到对方并选出领导者。

任何计算机上的访问端口8080都会出现错误,因为没有当选的领导者,但它确实解决了。

$ cat /etc/mesos-master/quorum

在每台主机上输出2。

我也停止/重启了一切。在主节点上:

$ sudo service mesos-master stop\
sudo service marathon stop\
sudo service zookeeper stop\
sudo service mesos-master start\
sudo service marathon start\
sudo service zookeeper start

在每台奴隶机上

$ sudo service mesos-slave stop\
sudo service mesos-slave start

仍然没有一个奴隶被发现,也没有领导人当选。

我的日志在所有3个IP上都是干净的(由于没有重定向,我得到了每个IP),你可以在这里查看每个IP:

mesos-primary-1

Log file created at: 2015/10/02 11:00:01
Running on machine: mesos-primary-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:01.532337 13722 logging.cpp:172] INFO level logging started!
I1002 11:00:01.532865 13722 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:01.532894 13722 main.cpp:231] Version: 0.24.1
I1002 11:00:01.532903 13722 main.cpp:234] Git tag: 0.24.1
I1002 11:00:01.532909 13722 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:01.533020 13722 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:01.546877 13722 leveldb.cpp:176] Opened db in 13.691496ms
I1002 11:00:01.550370 13722 leveldb.cpp:183] Compacted db in 2.522303ms
I1002 11:00:01.550559 13722 leveldb.cpp:198] Created db iterator in 118591ns
I1002 11:00:01.550618 13722 leveldb.cpp:204] Seeked to beginning of db in 1151ns
I1002 11:00:01.550642 13722 leveldb.cpp:273] Iterated through 0 keys in the db in 767ns
I1002 11:00:01.551029 13722 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:01.553994 13743 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:01.556193 13740 recover.cpp:449] Starting replica recovery
I1002 11:00:01.561755 13722 main.cpp:465] Starting Mesos master
I1002 11:00:01.563489 13740 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:01.568989 13722 master.cpp:378] Master 20151002-110001-2874854303-5050-13722 (159.203.90.171) started on 159.203.90.171:5050
I1002 11:00:01.569059 13722 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.90.171" --initialize_driver_logging="true" --ip="159.203.90.171" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:01.569535 13722 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:01.569581 13722 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:01.569608 13722 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:01.569718 13722 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:01.570199 13722 authenticator.cpp:512] Initializing server SASL
I1002 11:00:01.582969 13722 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:01.584786 13743 contender.cpp:149] Joining the ZK group
I1002 11:00:11.573873 13747 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
I1002 11:01:06.547200 13743 http.cpp:321] HTTP GET for /master/state.json from 173.243.85.102:51963 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'

mesos-primary-2

Log file created at: 2015/10/02 11:00:01
Running on machine: mesos-primary-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:01.532337 13722 logging.cpp:172] INFO level logging started!
I1002 11:00:01.532865 13722 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:01.532894 13722 main.cpp:231] Version: 0.24.1
I1002 11:00:01.532903 13722 main.cpp:234] Git tag: 0.24.1
I1002 11:00:01.532909 13722 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:01.533020 13722 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:01.546877 13722 leveldb.cpp:176] Opened db in 13.691496ms
I1002 11:00:01.550370 13722 leveldb.cpp:183] Compacted db in 2.522303ms
I1002 11:00:01.550559 13722 leveldb.cpp:198] Created db iterator in 118591ns
I1002 11:00:01.550618 13722 leveldb.cpp:204] Seeked to beginning of db in 1151ns
I1002 11:00:01.550642 13722 leveldb.cpp:273] Iterated through 0 keys in the db in 767ns
I1002 11:00:01.551029 13722 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:01.553994 13743 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:01.556193 13740 recover.cpp:449] Starting replica recovery
I1002 11:00:01.561755 13722 main.cpp:465] Starting Mesos master
I1002 11:00:01.563489 13740 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:01.568989 13722 master.cpp:378] Master 20151002-110001-2874854303-5050-13722 (159.203.90.171) started on 159.203.90.171:5050
I1002 11:00:01.569059 13722 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.90.171" --initialize_driver_logging="true" --ip="159.203.90.171" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:01.569535 13722 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:01.569581 13722 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:01.569608 13722 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:01.569718 13722 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:01.570199 13722 authenticator.cpp:512] Initializing server SASL
I1002 11:00:01.582969 13722 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:01.584786 13743 contender.cpp:149] Joining the ZK group
I1002 11:00:11.573873 13747 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying

mesos-primary-3

Log file created at: 2015/10/02 11:00:12
Running on machine: mesos-primary-3
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:12.609675 17105 logging.cpp:172] INFO level logging started!
I1002 11:00:12.610414 17105 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:12.610452 17105 main.cpp:231] Version: 0.24.1
I1002 11:00:12.610468 17105 main.cpp:234] Git tag: 0.24.1
I1002 11:00:12.610483 17105 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:12.610576 17105 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:12.618232 17105 leveldb.cpp:176] Opened db in 7.382537ms
I1002 11:00:12.619810 17105 leveldb.cpp:183] Compacted db in 1.512691ms
I1002 11:00:12.619876 17105 leveldb.cpp:198] Created db iterator in 27030ns
I1002 11:00:12.619910 17105 leveldb.cpp:204] Seeked to beginning of db in 1254ns
I1002 11:00:12.619925 17105 leveldb.cpp:273] Iterated through 0 keys in the db in 339ns
I1002 11:00:12.620028 17105 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:12.620930 17125 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:12.621615 17128 recover.cpp:449] Starting replica recovery
I1002 11:00:12.626735 17105 main.cpp:465] Starting Mesos master
I1002 11:00:12.627024 17128 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:12.633635 17123 master.cpp:378] Master 20151002-110012-321094504-5050-17105 (104.131.35.19) started on 104.131.35.19:5050
I1002 11:00:12.633828 17123 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="104.131.35.19" --initialize_driver_logging="true" --ip="104.131.35.19" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:12.635736 17123 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:12.635771 17123 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:12.635802 17123 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:12.635835 17123 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:12.636078 17123 authenticator.cpp:512] Initializing server SASL
I1002 11:00:12.643378 17125 contender.cpp:149] Joining the ZK group
I1002 11:00:12.643826 17123 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:22.633390 17130 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying

我按照this digital ocean guide中给出的指南设置机器。

运行

MASTER=$(mesos-resolve `cat /etc/mesos/zk`) mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5”

Yields

2015-10-02 12:30:26,137:14558(0x7f8dbb743700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@716: Client environment:host.name=mesos-primary-1
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-57-generic
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@725: Client environment:os.version=#95-Ubuntu SMP Fri Jun 19 09:28:15 UTC 2015
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@753: Client environment:user.dir=/root
2015-10-02 12:30:26,142:14558(0x7f8dbb743700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181 sessionTimeout=10000 watcher=0x7f8dc3625610 sessionId=0 sessionPasswd=<null> context=0x7f8da8003960 flags=0
2015-10-02 12:30:26,142:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.35.19:2181]
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.35.19:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.117.124:2181]
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.117.124:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:26,145:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [159.203.90.171:2181]
2015-10-02 12:30:26,147:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [159.203.90.171:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,484:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.35.19:2181]
2015-10-02 12:30:29,485:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.35.19:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,485:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.117.124:2181]
2015-10-02 12:30:29,486:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.117.124:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,487:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [159.203.90.171:2181]
2015-10-02 12:30:29,488:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [159.203.90.171:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Failed to detect master from 'zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos' within 5secs
root@mesos-primary-1:~# mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5"`

有没有人有任何想法?

1 个答案:

答案 0 :(得分:1)

对我而言,您的计算机相互之间无法访问,或者在正确端口上的部分或全部计算机上端口被阻止。确保:

A。在2181(zookeeper),2888和3888(分别是奴隶加入和大师选举)和5050(mesos)/ 8080(如果您正在使用马拉松)中取消阻止端口您的桌面/笔记本电脑的用户界面。奴隶只需2888我相信可以从大师那里获得。

B。您可以先从一台机器ping所有其他主机,即使用主机1和ping主机2和3.

C。在担心奴隶之前,首先尝试正确调试形成群集的主人。

您似乎在这里有一套好的配置和正确的仲裁设置,一旦您确定机器可以相互连接,您就可以调查其他潜在问题。让我们知道它是怎么回事!