Question

我运行了一个小型GKE群集，其中包含几个节点池（每个节点池2-8个节点，其中一些可抢占）。我开始发现节点本身存在很多运行状况问题，并且经历了很长时间（超过30分钟）的pod操作。这包括终止pod，启动pod，在pod中启动initContainer，在pod中启动主容器等。下面的示例。该集群运行一些NodeJS，PHP和Nginx容器，以及一个Elastic，Redis和NFS容器。此外，还有一些基于PHP的CronJobs。他们共同组成了一个位于CDN后面的网站。

我的问题是：我该如何在GKE上进行调试，这可能是什么原因？

我尝试通过SSH进入支持节点以检查日志的VM实例，但是我的SSH连接始终超时，不确定这是否正常。

症状：节点在Ready和NotReady之间摆动：

$ kubectl get nodes
NAME                                    STATUS     ROLES    AGE     VERSION
gke-cluster-default-pool-4fa127c-l3xt   Ready      <none>   62d     v1.13.6-gke.13
gke-cluster-default-pool-791e6c2-7b01   NotReady   <none>   45d     v1.13.6-gke.13
gke-cluster-preemptible-0f81875-cc5q    Ready      <none>   3h40m   v1.13.6-gke.13
gke-cluster-preemptible-0f81875-krqk    NotReady   <none>   22h     v1.13.6-gke.13
gke-cluster-preemptible-0f81875-mb05    Ready      <none>   5h42m   v1.13.6-gke.13
gke-cluster-preemptible-2453785-1c4v    Ready      <none>   22h     v1.13.6-gke.13
gke-cluster-preemptible-2453785-nv9q    Ready      <none>   134m    v1.13.6-gke.13
gke-cluster-preemptible-2453785-s7r2    NotReady   <none>   22h     v1.13.6-gke.13

症状：有时有时会重新启动节点：

2019-08-09 14:23:54.000 CEST
Node gke-cluster-preemptible-0f81875-mb05 has been rebooted, boot id: e601f182-2eab-46b0-a953-7787f95d438

症状：群集不健康：

2019-08-09T11:29:03Z Cluster is unhealthy 
2019-08-09T11:33:25Z Cluster is unhealthy 
2019-08-09T11:41:08Z Cluster is unhealthy 
2019-08-09T11:45:10Z Cluster is unhealthy 
2019-08-09T11:49:11Z Cluster is unhealthy 
2019-08-09T11:53:23Z Cluster is unhealthy

症状：节点日志中各种PLEG运行状况错误（这种类型的条目很多很多）

12:53:10.573176 1315163 kubelet.go:1854] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m26.30454685s ago; threshold is 3m0s] 
12:53:18.126428 1036 setters.go:520] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-08-09 12:53:18.126363615 +0000 UTC m=+3924434.187952856 LastTransitionTime:2019-08-09 12:53:18.126363615 +0000 UTC m=+3924434.187952856 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m5.837134315s ago; threshold is 3m0s}
12:53:38.627284 1036 kubelet.go:1854] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m26.338024015s ago; threshold is 3m0s]

症状：豆荚发出“网络未就绪”错误：

2019-08-09T12:42:45Z network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] 
2019-08-09T12:42:47Z network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] 
2019-08-09T12:42:49Z network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]

症状：豆荚抱怨“超出上下文截止日期”：

2019-08-09T08:04:07Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
2019-08-09T08:04:15Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
2019-08-09T08:04:20Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
2019-08-09T08:04:26Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded

显然发生了一些特别奇怪的事情，但是IOPS，入口请求，CPU /内存饱和的数量相当少。我希望某些症状可以向我指出可以进一步调试的方向。但是似乎这些错误到处都是。

Answer 1

鉴于GKE是一种托管解决方案，并且其运行涉及许多系统，我认为最好与GCP support team接触。

他们有特定的工具来定位节点上的问题（如果有），并且可以更深入地研究日志以确定问题的根本原因。

到目前为止，您显示的日志可能指向this old issue，显然与Docker有关，并且还存在CNI尚未就绪的问题，从而阻止了节点向主节点报告，认为这些节点未准备就绪。

请认为这仅仅是推测，因为支持团队将能够更深入地挖掘并提供更准确的建议。

如何在GKE上调试节点运行状况错误？

1 个答案: