所有Kubernetes工作者报告“错误更新节点状态,将重试:错误获取节点”

时间:2019-04-12 02:50:41

标签: amazon-web-services kubernetes kubernetes-ingress

在纠正自己之前,在15分钟内,所有工作人员的Pod意外地不可用。

“ kube-scheduler”开始报告有关多个Pod和Cron作业的以下消息

 <namespace-xx>/prod-xxx-process-invoice-pdf-1554813000-tzbbg already in flight, abandoning

Pod已排定并自动启动并恢复正常,守护进程集在继续运行的过程中不受影响

集群具有以下设置:

  • Kubernetes 1.9.3
  • AWS,配备了kube-aws
  • 配置了多az
  • 集群母版,集群(跨AZ)
  • 运行CoreOS的员工

目标:确定造成此中断的原因以及原因

工人日志已报告:

事件发生前约四分钟,我们看到有关后悔的消息

Apr 09 12:26:18 ip-x-x-x-x.ec2.internal kubelet-wrapper[1648]: W0409 12:26:18.366087    1648 kubelet_pods.go:855] Unable to retrieve pull secret bcaas/regsecret for alpha/prod-quicksilver-mock-b2bc-5b9d446b96-6lwmv due to secrets "regsecret" not found.  The image pull may not succeed.
Apr 09 12:26:22 ip-x-x-x-x.ec2.internal kubelet-wrapper[1648]: W0409 12:26:22.359932    1648 kubelet_pods.go:855] Unable to retrieve pull secret tollstt/regsecret for beta/tolls-h2h-tolls-h2h-59bd9fdbd7-qwwkm due to secrets "regsecret" not found.  The image pull may not succeed.
Apr 09 12:26:25 ip-x-x-x-x.ec2.internal kubelet-wrapper[1648]: W0409 12:26:25.360631    1648 kubelet_pods.go:855] Unable to retrieve pull secret bcaas/regsecret for alpha/prod-colossus-pdf-generator-5cc59944d-946l9 due to secrets "regsecret" not found.  The image pull may not succeed.
Apr 09 12:26:34 ip-x-x-x-x.ec2.internal kubelet-wrapper[1648]: W0409 12:26:34.360547    1648 kubelet_pods.go:855] Unable to retrieve pull secret default/regsecret for default/hello-world-hello-6dfd58c68b-z4n2d due to secrets "regsecret" not found.  The image pull may not succeed.
Apr 09 12:26:57 ip-x-x-x-x.ec2.internal kubelet-wrapper[1648]: W0409 12:26:57.360093    1648 kubelet_pods.go:855] Unable to retrieve pull secret onlineapp/regsecret for gamma/onlineapp-prod-6d58594794-5qqdh due to secrets "regsecret" not found.  The image pull may not succeed.

在事件发生期间,我们看到以下类型的消息

 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-x-x-x-x.ec2.internal": Get https://api.k8s-cluster.prod.regionx.comanyinfra.com:443/api/v1/nodes/ip-x-x-x-x.ec2.internal?resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Apr 09 12:31:29 ip-x-x-x-x.ec2.internal kubelet-wrapper[1648]: E0409 12:31:29.502554    1648 kubelet_node_status.go:375] Unable to update node status: update node status exceeds retry count

在间隔期间,我们可以看到Pod活动恢复正常

Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.427655   772 prefs.cc:51] certificate-report-to-send-update not present in /var/lib/update_engine/prefs
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.427716   772 prefs.cc:51] certificate-report-to-send-download not present in /var/lib/update_engine/prefs
Apr 09 12:43:24ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.427953   772 omaha_request_params.cc:59] Current group set to stable
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.428097   772 update_attempter.cc:483] Already updated boot flags. Skipping.
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.428105   772 update_attempter.cc:626] Scheduling an action processor start.
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.428135   772 action_processor.cc:36] ActionProcessor::StartProcessing: OmahaRequestAction
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.428275   772 omaha_request_action.cc:245] Posting an Omaha request to https://public.update.core-os.net/v1/update/
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.428287   772 omaha_request_action.cc:246] Request: <?xml version="1.0" encoding="UTF-8"?>
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: <request protocol="3.0" version="update_engine-0.4.9" updaterversion="update_engine-0.4.9" installsource="scheduler" ismachine="1">
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]:     <os version="Chateau" platform="CoreOS" sp="2023.5.0_x86_64"></os>
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]:     <app appid="{e96281a6-d1af-4bde-9a0a-97b76e56dc57}" version="2023.5.0" track="stable" bootid="{6afd1274-afbb-45aa-bb0f-e7be3fae6af7}" oem="ami" oemversion="0.1.1-r1" alephversion="2023.5.0" machineid="5e623d7043054394a9cd731f5bc4c50d" lang="en-US" board="amd64-usr" hardware_class="" delta_okay="false" >
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]:         <ping active="1"></ping>
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]:         <updatecheck></updatecheck>
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]:         <event eventtype="3" eventresult="2" previousversion=""></event>
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]:     </app>
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: </request>
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.428293   772 libcurl_http_fetcher.cc:48] Starting/Resuming transfer
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.428349   772 libcurl_http_fetcher.cc:164] Setting up curl options for HTTPS
Apr 09 12:43:24 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:24.428449   772 libcurl_http_fetcher.cc:427] Setting up timeout source: 1 seconds.
Apr 09 12:43:25 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:25.612932   772 libcurl_http_fetcher.cc:240] HTTP response code: 200
Apr 09 12:43:25 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:25.614888   772 libcurl_http_fetcher.cc:297] Transfer completed (200), 287 bytes downloaded
Apr 09 12:43:25 ip-x-x-x-x.ec2.internal update_engine[772]: I0409 12:43:25.614912   772 omaha_request_action.cc:592] Omaha request response: <?xml version="1.0" encoding="UTF-8"?>

堆日志

E0409 12:29:05.001242       1 summary.go:404] node ip-x-x-x-x.ec2.internal is not ready0

kube-proxy日志

在这里,我们看到所有工作人员都在针对etcd服务器进行超时

E0409 12:32:17.402867       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.122.41.47:36370->10.122.41.79:443: read: connection timed out
E0409 12:32:17.575104       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.122.41.86:45248->10.122.41.79:443: read: connection timed out
E0409 12:32:17.375428       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.122.41.36:37648->10.122.41.79:443: read: connection timed out
E0409 12:32:35.909590       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.122.41.110:54768->10.122.41.18:443: read: connection timed out
  • 我们已经排除了当时的CPU和内存资源限制
  • 确认工人没有重启
  • 排除证书续订

0 个答案:

没有答案