如何解决在etd领导选举中的竞争状况?

时间:2014-11-24 12:17:39

标签: race-condition coreos etcd

在测试具有三个节点的Core Os群集时,在成功添加和删除少量其他节点后,我遇到了以下问题,原因是由于etcd选举过程中的竞争条件。

检查新领导者:

$ curl -L http://127.0.0.1:4001/v2/stats/leader
{"errorCode":300,"message":"Raft Internal Error","index":629006}

群集中每台计算机的Journalctl给出:

$ journalctl -r -u etcd
-- Logs begin at Wed 2014-11-12 15:09:01 UTC, end at Mon 2014-11-24 10:47:34 UTC. --
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.307 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: term #5221 started.
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.306 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'.
Nov 24 10:47:33 node-1 etcd[56576]: [etcd] Nov 24 10:47:33.098 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'.
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: term #5219 started.
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'.
Nov 24 10:47:31 node-1 etcd[56576]: [etcd] Nov 24 10:47:31.962 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'.

列出有车队的机器失败:

$ fleetctl list-machines
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 200ms
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused

列出群集中的计算机:

$ curl -L http://127.0.0.1:7001/v2/admin/machines
[{"name":"","state":"follower","clientURL":"http://100.72.62.35:4001","peerURL":"http://100.72.62.35:7001"},
{"name":"555cca74216644fea48990673b3d539c","state":"follower","clientURL":"http://100.72.62.59:4001","peerURL":"http://100.72.62.59:7001"},
{"name":"965d12d38a4a4b2c807bd232fb7b0db7","state":"follower","clientURL":"http://100.72.20.153:4001","peerURL":"http://100.72.20.153:7001"},
{"name":"a1b566dedb194c259f7eb2ffde5595b1","state":"follower","clientURL":"http://100.72.62.2:4001","peerURL":"http://100.72.62.2:7001"},
{"name":"a45efba827754b5f93c38b751a0ae273","state":"follower","clientURL":"http://100.72.62.31:4001","peerURL":"http://100.72.62.31:7001"},
{"name":"d041738235a9483cb814d37ca7fa4b6d","state":"follower","clientURL":"http://100.72.20.18:4001","peerURL":"http://100.72.20.18:7001"}]

但目前只有三台机器在运行。我试图添加额外的机器来达到法定人数无济于事。 我正在运行以下版本:

$ etcdctl -v  
etcdctl version 0.4.6

,正如此处https://coreos.com/docs/distributed-configuration/etcd-api/#cluster-config所述,强制领导者的领导者模块已被删除。丑陋的部分是,由于没有法定人数,我无法从机器列表中删除当前未运行的那些例如:

$ curl -L -XDELETE http://127.0.0.1:7001/v2/admin/machines/2abbf47a9e644bc69652a986d796d7a6

没有效果。有没有办法保存群集?

1 个答案:

答案 0 :(得分:1)

据我了解,您可以保存群集,但这不值得。

群集不接受新计算机,因为它需要一个法定数量来添加新计算机,并且没有法定数量的现有计算机。移除机器和删除密钥也是如此。

如果您可以启动列为群集成员的足够计算机,并让它们成功地作为群集成员工作,那么您将拥有一个法定人数并保存群集。

从我所看到的,您有六台机器被列为集群成员。您需要至少有四个运行现有集群才能运行。