我的AKS群集被打倒了,我该如何恢复?

时间:2017-11-30 04:02:21

标签: azure-container-service

我一直在玩AKS中的单个代理群集上对我的应用程序进行负载测试。在测试期间,与仪表板的连接停止并且从未恢复。我的应用程序似乎也失败了,所以我假设集群处于不良状态。

API服务器是restate-f4cbd3d9.hcp.centralus.azmk8s.io

kubectl cluster-info dump显示以下错误:

{
    "name": "kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
    "namespace": "kube-system",
    "selfLink": "/api/v1/namespaces/kube-system/events/kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
    "uid": "47f57d3c-d577-11e7-88d4-0a58ac1f0249",
    "resourceVersion": "185572",
    "creationTimestamp": "2017-11-30T02:36:34Z",
    "InvolvedObject": {
        "Kind": "Pod",
        "Namespace": "kube-system",
        "Name": "kube-dns-v20-6c8f7f988b-9wpx9",
        "UID": "9d2b20f2-d3f5-11e7-88d4-0a58ac1f0249",
        "APIVersion": "v1",
        "ResourceVersion": "299",
        "FieldPath": "spec.containers{kubedns}"
    },
    "Reason": "Unhealthy",
    "Message": "Liveness probe failed: Get http://10.244.0.4:8080/healthz-kubedns: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
    "Source": {
        "Component": "kubelet",
        "Host": "aks-agentpool-34912234-0"
    },
    "FirstTimestamp": "2017-11-30T02:23:50Z",
    "LastTimestamp": "2017-11-30T02:59:00Z",
    "Count": 6,
    "Type": "Warning"
}

以及Kube-System中的一些Pod Sync错误。

问题示例:

az aks browse -g REstate.Server -n REstate

Merged "REstate" as current context in C:\Users\User\AppData\Local\Temp\tmp29d0conq

Proxy running on http://127.0.0.1:8001/
Press CTRL+C to close the tunnel...
error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out

1 个答案:

答案 0 :(得分:2)

您可能需要ssh到节点以查看Kubelet服务是否正在运行。将来,您可以设置资源配额,从而耗尽集群节点中的所有资源。

资源配额 - https://kubernetes.io/docs/concepts/policy/resource-quotas/