创建Pod沙箱失败:rpc错误:代码=未知desc = NetworkPlugin cni无法设置Pod网络

时间:2018-07-04 08:59:57

标签: kubernetes kubectl kubeadm flannel

在k8s(v1.10)集群上发布Redis POD,并且POD创建停留在“ ContainerCreating”上

public class CustomerContext: DbContext
{
    public CustomerContext(DbContextOptions<CustomerContext> options) : base(options)
    {
    }

    public DbSet<Customer> Customers { get; set; }
}

public class ProjectContext: DbContext
{
    public ProjectContext(DbContextOptions<ProjectContext> options) : base(options)
    {
    }

    public DbSet<Project> Projects { get; set; }
}

6 个答案:

答案 0 :(得分:3)

确保/etc/cni/net.d和它的/opt/cni/bin朋友都存在,并且正确地填充了{em>所有节点上的CNI配置文件和二进制文件。特别是对于法兰绒,可能会使用flannel cni repo

答案 1 :(得分:0)

When i used calico as CNI and I faced the similar issue. Container remained in creating state, I checked for /etc/cni/net.d and /opt/cni/bin on master both are present but not sure if this is required on worker node as well.

root@KubernetesMaster:/opt/cni/bin# kubectl get pods
NAME                   READY   STATUS              RESTARTS   AGE
nginx-5c7588df-5zds6   0/1     ContainerCreating   0          21m
root@KubernetesMaster:/opt/cni/bin# kubectl get nodes
NAME               STATUS   ROLES    AGE   VERSION
kubernetesmaster   Ready    master   26m   v1.13.4
kubernetesslave1   Ready    <none>   22m   v1.13.4
root@KubernetesMaster:/opt/cni/bin#

kubectl describe pods
Name:               nginx-5c7588df-5zds6
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               kubernetesslave1/10.0.3.80
Start Time:         Sun, 17 Mar 2019 05:13:30 +0000
Labels:             app=nginx
                    pod-template-hash=5c7588df
Annotations:        <none>
Status:             Pending
IP:
Controlled By:      ReplicaSet/nginx-5c7588df
Containers:
  nginx:
    Container ID:
    Image:          nginx
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qtfbs (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-qtfbs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qtfbs
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                    From                       Message
  ----     ------                  ----                   ----                       -------
  Normal   Scheduled               18m                    default-scheduler          Successfully assigned default/nginx-5c7588df-5zds6 to kubernetesslave1
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "123d527490944d80f44b1976b82dbae5dc56934aabf59cf89f151736d7ea8adc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8cc5e62ebaab7075782c2248e00d795191c45906cc9579464a00c09a2bc88b71" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "30ffdeace558b0935d1ed3c2e59480e2dd98e983b747dacae707d1baa222353f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "630e85451b6ce2452839c4cfd1ecb9acce4120515702edf29421c123cf231213" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "820b919b7edcfc3081711bb78b79d33e5be3f7dafcbad29fe46b6d7aa22227aa" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "abbfb5d2756f12802072039dec20ba52f546ae755aaa642a9a75c86577be589f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "dfeb46ffda4d0f8a434f3f3af04328fcc4b6c7cafaa62626e41b705b06d98cc4" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9ae3f47bb0282a56e607779d3267127ee8b0ae1d7f416f5a184682119203b1c8" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "546d07f1864728b2e2675c066775f94d658e221ada5fb4ed6bf6689ec7b8de23" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Normal   SandboxChanged          18m (x12 over 18m)     kubelet, kubernetesslave1  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  3m39s (x829 over 18m)  kubelet, kubernetesslave1  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f586be437843537a3082f37ad139c88d0eacfbe99ddf00621efd4dc049a268cc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
root@KubernetesMaster:/etc/cni/net.d#

在工作节点上,NGINX尝试启动但退出了,我不确定这是怎么回事-我是kubernetes的新手,无法解决此问题-

root @ kubernetesslave1:/ home / ubuntu#docker ps 容器ID图像命令创建的状态端口名称 5ad5500e8270 fadcc5d2b066“ / usr / local / bin / kube…” 3分钟前上3分钟k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1 b1c9929ebe9e k8s.gcr.io/pause:3.1“ / pause” 3分钟前上3分钟k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1 ceb78340b563 k8s.gcr.io/pause:3.1“ / pause” 3分钟前上3分钟k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1 root @ kubernetesslave1:/ home / ubuntu#docker ps 容器ID图像命令创建的状态端口名称 5ad5500e8270 fadcc5d2b066“ / usr / local / bin / kube…” 3分钟前上3分钟k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1 b1c9929ebe9e k8s.gcr.io/pause:3.1“ / pause” 3分钟前上3分钟k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1 ceb78340b563 k8s.gcr.io/pause:3.1“ / pause” 3分钟前上3分钟k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1 root @ kubernetesslave1:/ home / ubuntu#docker ps 容器ID图像命令创建的状态端口名称 5ad5500e8270 fadcc5d2b066“ / usr / local / bin / kube…” 3分钟前上3分钟k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1 b1c9929ebe9e k8s.gcr.io/pause:3.1“ / pause” 3分钟前上3分钟k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1 ceb78340b563 k8s.gcr.io/pause:3.1“ / pause” 3分钟前上3分钟k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1

root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                  PORTS               NAMES
94b2994401d0        k8s.gcr.io/pause:3.1   "/pause"                 1 second ago        Up Less than a second                       k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_534
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   4 minutes ago       Up 4 minutes                                k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   4 minutes ago       Up 4 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                  PORTS               NAMES
f72500cae2b7        k8s.gcr.io/pause:3.1   "/pause"                 1 second ago        Up Less than a second                       k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_585
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   4 minutes ago       Up 4 minutes                                k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   5 minutes ago       Up 5 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e        k8s.gcr.io/pause:3.1   "/pause"                 5 minutes ago       Up 5 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563        k8s.gcr.io/pause:3.1   "/pause"                 5 minutes ago       Up 5 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1

我也检查了工作节点上的/etc/cni/net.d和/ opt / cni / bin,它在那里-

root@kubernetesslave1:/home/ubuntu# cd /etc/cni
root@kubernetesslave1:/etc/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 net.d
root@kubernetesslave1:/etc/cni# cd /opt/cni
root@kubernetesslave1:/opt/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 bin
root@kubernetesslave1:/opt/cni# cd bin
root@kubernetesslave1:/opt/cni/bin# ls -ltr
total 107440
-rwxr-xr-x 1 root root  3890407 Aug 17  2017 bridge
-rwxr-xr-x 1 root root  3475802 Aug 17  2017 ipvlan
-rwxr-xr-x 1 root root  3520724 Aug 17  2017 macvlan
-rwxr-xr-x 1 root root  3877986 Aug 17  2017 ptp
-rwxr-xr-x 1 root root  3475750 Aug 17  2017 vlan
-rwxr-xr-x 1 root root  9921982 Aug 17  2017 dhcp
-rwxr-xr-x 1 root root  2605279 Aug 17  2017 sample
-rwxr-xr-x 1 root root 32351072 Mar 17 05:19 calico
-rwxr-xr-x 1 root root 31490656 Mar 17 05:19 calico-ipam
-rwxr-xr-x 1 root root  2856252 Mar 17 05:19 flannel
-rwxr-xr-x 1 root root  3084347 Mar 17 05:19 loopback
-rwxr-xr-x 1 root root  3036768 Mar 17 05:19 host-local
-rwxr-xr-x 1 root root  3550877 Mar 17 05:19 portmap
-rwxr-xr-x 1 root root  2850029 Mar 17 05:19 tuning
root@kubernetesslave1:/opt/cni/bin#

答案 2 :(得分:0)

AWS EKS尚不支持t3a,m5ad r5ad实例

答案 3 :(得分:0)

当我在AWS EKS上添加PVC时,出现了这个问题。

将aws-node CNI插件更新到最新版本可以解决-

https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html

答案 4 :(得分:0)

以下步骤重置了 kubernetes 集群并帮助我解决了我的问题。

  1. 停止所有正在运行的 Pod
  2. 从集群中删除所有工作节点
  3. 在主节点和节点上执行 kubeadm reset
  4. 启动主节点 kubeadm init --apiserver-advertise-address
  5. 安装Pod网络“WeaveNet” kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')&env.IPALLOC_RANGE=192.168.0.0/16"
  6. 将节点加入集群
  7. 重启所有节点

答案 5 :(得分:0)

我在 GCP 上的 GKE 集群和我的抢占式节点池之一遇到了这个问题。感谢 @mdaniel 检查 /etc/cni/net.d 完整性的提示,我可以使用命令 gcloud compute ssh <name of some node> --zone <zone-of-cluster> --internal-ip 通过 ssh 再次将问题重现到测试集群的节点中。然后我简单地编辑了文件 /etc/cni/net.d/10-gke-ptp.conflist 并弄乱了 "routes": [ {"dst": "0.0.0.0/0"} ] 上的值(从 0.0.0.0/0 更改为 1.0.0.0/0)。

在那之后,我删除了在其中运行的 pod,它们都被困在 ContainerCreating 状态,永远生成带有错误 Failed create pod sandbox: rpc error: code... 的 kublet 事件

请注意,为了进行测试,我已将节点池设置为最多 1 个节点。否则,它将扩展一个新的 Pod,并将在新节点上重新创建 Pod。在我的生产事件中,节点池达到了最大节点数,因此将我的测试设置为最大 1 个节点重现了类似的情况。

由于从 GKE 中删除节点解决了生产中的问题,我创建了一个 Python 脚本,该脚本列出集群上的所有事件并过滤具有关键字 "Failed create pod sandbox: rpc error: code" 的事件。然后我查看所有事件并获取它们的 pod,然后从 pod 中获取节点。最后,我遍历节点,从 Kubernetes API 和 Compute API 中删除它们,并使用其各自的 Python 客户端。对于 Python 脚本,我使用了库:kubernetesgoogle-cloud-compute

这是一个更简单的脚本版本。使用前先测试一下:

from kubernetes import client, config
from google.cloud.compute_v1.services.instances import InstancesClient


ERROR_KEYWORDS = [
    'Failed to create pod sandbox'.lower()
]

config.load_kube_config()
v1 = client.CoreV1Api()

events_result = v1.list_event_for_all_namespaces()

filtered_events = []

# filter only the events containing ERROR_KEYWORDS
for event in events_result.items:
    for error_keyword in ERROR_KEYWORDS:
        if error_keyword in event.message.lower():
            filtered_events.append(event)


# gets the list of pods from those events
pods_list = {}

for event in filtered_events:
    try:
        pod = v1.read_namespaced_pod(
            event.involved_object.name,
            namespace=event.involved_object.namespace
        )

        pod_dict = {
            "name": event.involved_object.name,
            "namespace": event.involved_object.namespace,
            "node": pod.spec.node_name
        }

        pods_list[event.involved_object.name] = pod_dict

    except Exception as e:
        pass


# Get the nodes from those pods
broken_nodes = set()

for name, pod_dict in pods_list.items():
    if pod_dict.get('node'):
        broken_nodes.add(pod_dict["node"])


broken_nodes = list(broken_nodes)

# Deletes the nodes from both Kubernetes API and Compute Engine API
if broken_nodes:
    broken_nodes_str = ", ".join(broken_nodes)
    print(f'BROKEN NODES: "{broken_nodes_str}"')
    for node in broken_nodes:

        try:
            api_response = v1.delete_node(node)
        except Exception as e:
            pass

        time.sleep(30)
        
        try:
            result = gcp_client.delete(project=PROJECT_ID, zone=CLUSTER_ZONE, instance=node)
        except Exception as e:
            pass