我有3个AWS ec2实例的CoreOS群集。使用CoreOS堆栈cloudformation设置集群。群集启动并运行后,我需要更新自动扩展策略以获取ec2实例配置文件。我复制了现有的自动缩放配置文件并更新了ec2s的IAM角色。然后我终止了机队中的EC2,让自动扩展启动新实例。新实例确实承担了新角色,但群集似乎丢失了群集机器信息:
ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
Drop-In: /run/systemd/system/etcd.service.d
└─10-oem.conf, 20-cloudinit.conf
Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
Main PID: 14124 (code=exited, status=1/FAILURE)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL | fail joining the cluster via given peers after 3 retries
cloud-init使用了相同的标记。 https://discovery.etcd.io/<cluster令牌&gt;显示6台机器,3台死机,3台新机器。所以看起来好像3个新实例加入了集群。期刊-u etcd.service日志显示etcd在死实例上超时,并且新连接被拒绝连接。
journal -u etcd.service shows:
...
Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)
etcdctl --debug ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/? consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
也许这不是更新集群配置的正确过程,但如果集群确实需要自动扩展,无论出于何种原因(例如负载触发),机群仍然可以使用死实例和池中混合了新实例?
如何在不拆除和重建的情况下从这种情况中恢复?
雪山
答案 0 :(得分:1)
在此计划中,etcd不会保留法定数量的机器,也无法成功运行。进行自动缩放的最佳方案是设置两组计算机:
#cloud-config
coreos:
fleet:
metadata: "role=autoscale"
etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
units:
- name: fleet.service
command: start
验证员不会让我在我的答案中放入任何10.x
个IP地址(wtf!?),所以一定要替换它们。
答案 1 :(得分:1)
您必须至少有一台计算机始终使用发现令牌运行,一旦所有计算机都关闭,心跳将失败并且没有新用户可以加入,您将需要一个新令牌供群集加入。