Coreos机队在自动扩展后无法正常工作

时间:2014-09-24 18:56:45

标签: amazon-ec2 autoscaling coreos

我有3个AWS ec2实例的CoreOS群集。使用CoreOS堆栈cloudformation设置集群。群集启动并运行后,我需要更新自动扩展策略以获取ec2实例配置文件。我复制了现有的自动缩放配置文件并更新了ec2s的IAM角色。然后我终止了机队中的EC2,让自动扩展启动新实例。新实例确实承担了新角色,但群集似乎丢失了群集机器信息:

ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
   Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
  Drop-In: /run/systemd/system/etcd.service.d
       └─10-oem.conf, 20-cloudinit.conf
   Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
  Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
 Main PID: 14124 (code=exited, status=1/FAILURE)

Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process  exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO      | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING   | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL  | fail joining the cluster via given peers after 3 retries

cloud-init使用了相同的标记。 https://discovery.etcd.io/<cluster令牌&gt;显示6台机器,3台死机,3台新机器。所以看起来好像3个新实例加入了集群。期刊-u etcd.service日志显示etcd在死实例上超时,并且新连接被拒绝连接。

journal -u etcd.service shows: 
...

Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO      | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)

etcdctl --debug  ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?     consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error:  501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

也许这不是更新集群配置的正确过程,但如果集群确实需要自动扩展,无论出于何种原因(例如负载触发),机群仍然可以使用死实例和池中混合了新实例?

如何在不拆除和重建的情况下从这种情况中恢复?

雪山

2 个答案:

答案 0 :(得分:1)

在此计划中,etcd不会保留法定数量的机器,也无法成功运行。进行自动缩放的最佳方案是设置两组计算机:

  1. 固定数量(1-9)的etcd机器总是会启动的。这些设置使用发现令牌或静态网络,就像正常一样。
  2. 您的自动缩放组,它不启动etcd,而是配置fleet(和任何其他工具)以使用固定的etcd集群。您可以在cloud-config中执行此操作。这是一个示例,它还设置了一些车队元数据,以便您可以根据需要专门为自动调整的计算机安排作业:
  3. #cloud-config
    coreos:
      fleet:
        metadata: "role=autoscale"
        etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
      units:
        - name: fleet.service
          command: start
    

    验证员不会让我在我的答案中放入任何10.x个IP地址(wtf!?),所以一定要替换它们。

答案 1 :(得分:1)

您必须至少有一台计算机始终使用发现令牌运行,一旦所有计算机都关闭,心跳将失败并且没有新用户可以加入,您将需要一个新令牌供群集加入。