Calico CNI pod网络无法在EKS Kubernetes工作者节点上的不同主机上运行

时间:2019-09-13 13:58:50

标签: kubernetes aws-eks project-calico

我正在运行版本为1.12的香草EKS Kubernetes。

我使用CNI Genie允许自定义选择pod在启动时使用的CNI,并且我已经安装了标准Calico CNI设置。

使用CNI Genie,我将默认CNI配置为AWS CNI(aws节点),并且所有Pod都照常启动,并从我的VPC子网中分配了IP。

然后,我选择性地将印花布用作我要测试的一些基本吊舱的CNI。我正在使用默认的calico 192.168.0.0/16 CIDR范围。如果Pod位于相同的EKS工作程序节点上,则一切正常。

核心DNS的运行也很好(只要我让coredns容器在aws CNI上运行即可)。

但是,如果吊舱移动到另一个工作程序节点,则它们之间的联网将无法在集群内部工作。

我已经检查了calico自动配置的工作节点上的路由表,这对我来说似乎很合理。

这是我在所有命名空间中列出的广泛的pod:

NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE   IP                NODE                                       NOMINATED NODE
default       hello-node1-865588ccd7-64p5x               1/1     Running   0          31m   192.168.106.129   ip-10-0-2-31.eu-west-2.compute.internal    <none>
default       hello-node2-dc7bbcb74-gqpwq                1/1     Running   0          17m   192.168.25.193    ip-10-0-3-222.eu-west-2.compute.internal   <none>
kube-system   aws-node-cm2dp                             1/1     Running   0          26m   10.0.3.222        ip-10-0-3-222.eu-west-2.compute.internal   <none>
kube-system   aws-node-vvvww                             1/1     Running   0          31m   10.0.2.31         ip-10-0-2-31.eu-west-2.compute.internal    <none>
kube-system   calico-kube-controllers-56bfccb786-fc2j4   1/1     Running   0          30m   10.0.2.41         ip-10-0-2-31.eu-west-2.compute.internal    <none>
kube-system   calico-node-flmnl                          1/1     Running   0          31m   10.0.2.31         ip-10-0-2-31.eu-west-2.compute.internal    <none>
kube-system   calico-node-hcmqd                          1/1     Running   0          26m   10.0.3.222        ip-10-0-3-222.eu-west-2.compute.internal   <none>
kube-system   coredns-6c64c9f456-g2h9k                   1/1     Running   0          30m   10.0.2.204        ip-10-0-2-31.eu-west-2.compute.internal    <none>
kube-system   coredns-6c64c9f456-g5lhl                   1/1     Running   0          30m   10.0.2.200        ip-10-0-2-31.eu-west-2.compute.internal    <none>
kube-system   genie-plugin-hspts                         1/1     Running   0          26m   10.0.3.222        ip-10-0-3-222.eu-west-2.compute.internal   <none>
kube-system   genie-plugin-vqd2d                         1/1     Running   0          31m   10.0.2.31         ip-10-0-2-31.eu-west-2.compute.internal    <none>
kube-system   kube-proxy-jm7f7                           1/1     Running   0          26m   10.0.3.222        ip-10-0-3-222.eu-west-2.compute.internal   <none>
kube-system   kube-proxy-nnp76                           1/1     Running   0          31m   10.0.2.31         ip-10-0-2-31.eu-west-2.compute.internal    <none>

如您所见,两个 hello-node 窗格使用的是 Calico CNI。

我为hello-node pod提供了两项服务:

NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
hello-node1   ClusterIP   172.20.90.83    <none>        8081/TCP   43m
hello-node2   ClusterIP   172.20.242.22   <none>        8082/TCP   43m

我已经确认是否使用aws CNI启动hello节点pod,当它们使用群集服务名称在单独的主机上运行时,我可以在它们之间ping / curl。

当我如上所述使用Calico CNI时,一切停止工作。

在此测试群集中,我只有两个EKS worker主机。这是每个的路由:

K8s Worker 1路线

[ec2-user@ip-10-0-3-222 ~]$ ip route
default via 10.0.3.1 dev eth0
10.0.3.0/24 dev eth0 proto kernel scope link src 10.0.3.222
169.254.169.254 dev eth0
blackhole 192.168.25.192/26 proto bird
192.168.25.193 dev calia0da7d91dc2 scope link
192.168.106.128/26 via 10.0.2.31 dev tunl0 proto bird onlink

K8s Worker 2路线

[ec2-user@ip-10-0-2-31 ~]$ ip route
default via 10.0.2.1 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.31
10.0.2.41 dev enif4cf9019f11 scope link
10.0.2.200 dev eni412af1a0e55 scope link
10.0.2.204 dev eni04260ebbbe1 scope link
169.254.169.254 dev eth0
192.168.25.192/26 via 10.0.3.222 dev tunl0 proto bird onlink
blackhole 192.168.106.128/26 proto bird
192.168.106.129 dev cali19da7817849 scope link

对我来说,路线: 192.168.25.192/26 via 10.0.3.222 dev tunl0 proto bird onlink

告诉我,从该工作服务器(及其容器/吊舱)发往192.168.25.192/16子网的流量应在tunl0接口上发送至10.0.3.222(用于EC2主机的AWS VPC ENI)。

此路由位于EC2主机10.0.2.31上。因此,换句话说,当从该主机的容器与calico子网192.168.25.192/16上的容器进行通信时,网络流量应路由至10.0.3.222(我的其他EKS工作节点的ENI IP,在该节点上,使用Calico的容器在该子网上运行)。

要阐明我的测试过程,请执行以下操作:

  1. 执行到 hello-node1 窗格中,然后curl http://hello-node2:8082(或ping hello-node2 窗格中的calico分配的IP地址。

编辑

为了进一步测试,我在运行 hello-node2 容器的主机上运行tcpdump,捕获了端口8080(容器在此端口上侦听)。

我确实在要卷曲的测试容器正在运行的目标主机上获得活动,但是它似乎并不表示流量下降。

[ec2-user@ip-10-0-3-222 ~]$ sudo tcpdump -vv -x -X -i tunl0 'port 8080'
tcpdump: listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes
14:32:42.859238 IP (tos 0x0, ttl 254, id 63813, offset 0, flags [DF], proto TCP (6), length 60)
    10.0.2.31.29192 > 192.168.25.193.webcache: Flags [S], cksum 0xf932 (correct), seq 3206263598, win 28000, options [mss 1400,sackOK,TS val 2836614698 ecr 0,nop,wscale 7], length 0
        0x0000:  4500 003c f945 4000 fe06 9ced 0a00 021f  E..<.E@.........
        0x0010:  c0a8 19c1 7208 1f90 bf1b b32e 0000 0000  ....r...........
        0x0020:  a002 6d60 f932 0000 0204 0578 0402 080a  ..m`.2.....x....
        0x0030:  a913 4e2a 0000 0000 0103 0307            ..N*........
14:32:43.870168 IP (tos 0x0, ttl 254, id 63814, offset 0, flags [DF], proto TCP (6), length 60)
    10.0.2.31.29192 > 192.168.25.193.webcache: Flags [S], cksum 0xf53f (correct), seq 3206263598, win 28000, options [mss 1400,sackOK,TS val 2836615709 ecr 0,nop,wscale 7], length 0
        0x0000:  4500 003c f946 4000 fe06 9cec 0a00 021f  E..<.F@.........
        0x0010:  c0a8 19c1 7208 1f90 bf1b b32e 0000 0000  ....r...........
        0x0020:  a002 6d60 f53f 0000 0204 0578 0402 080a  ..m`.?.....x....
        0x0030:  a913 521d 0000 0000 0103 0307            ..R.........
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel

运行我的目标/测试容器的主机上的 calia0da7d91dc2 界面即使在我从其他主机上的另一个容器运行curl时,也会显示增加的RX数据包和字节数。流量肯定在穿越。

[ec2-user@ip-10-0-3-222 ~]$ ifconfig
calia0da7d91dc2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1440
        inet6 fe80::ecee:eeff:feee:eeee  prefixlen 64  scopeid 0x20<link>
        ether ee:ee:ee:ee:ee:ee  txqueuelen 0  (Ethernet)
        RX packets 84  bytes 5088 (4.9 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

是什么原因导致此处的主机之间无法联网?我缺少明显的东西吗?

编辑2 -有关Arjun Pandey- parjun8840的信息

以下是有关Calico配置的更多信息:

  • 我已在所有AWS EC2工作节点上禁用了源/目标检查
  • 我遵循最新的calico文档,配置IP池以用于跨子网和NAT,以用于群集外的流量

calicoctl配置注意:似乎工作负载端点不存在...

 me@mine ~ aws-vault exec my-vault-entry -- kubectl get IPPool --all-namespaces
NAME                  AGE
default-ipv4-ippool   1d

 me@mine ~ aws-vault exec my-vault-entry -- kubectl get IPPool default-ipv4-ippool -o yaml
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  annotations:
    projectcalico.org/metadata: '{"uid":"41bd2c82-d576-11e9-b1ef-121f3d7b4d4e","creationTimestamp":"2019-09-12T15:59:09Z"}'
  creationTimestamp: "2019-09-12T15:59:09Z"
  generation: 1
  name: default-ipv4-ippool
  resourceVersion: "500448"
  selfLink: /apis/crd.projectcalico.org/v1/ippools/default-ipv4-ippool
  uid: 41bd2c82-d576-11e9-b1ef-121f3d7b4d4e
spec:
  blockSize: 26
  cidr: 192.168.0.0/16
  ipipMode: CrossSubnet
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Never

 me@mine ~ aws-vault exec my-vault-entry -- calicoctl get nodes
NAME
ip-10-254-109-184.ec2.internal
ip-10-254-109-237.ec2.internal
ip-10-254-111-147.ec2.internal

 me@mine ~ aws-vault exec my-vault-entry -- calicoctl get workloadendpoints
WORKLOAD   NODE   NETWORKS   INTERFACE


 me@mine ~

以下是有关群集中的示例主机和测试容器的容器网络之一的一些网络信息:

主机ip a

[ec2-user@ip-10-254-109-184 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 02:1b:79:d1:c5:bc brd ff:ff:ff:ff:ff:ff
    inet 10.254.109.184/26 brd 10.254.109.191 scope global dynamic eth0
       valid_lft 2881sec preferred_lft 2881sec
    inet6 fe80::1b:79ff:fed1:c5bc/64 scope link
       valid_lft forever preferred_lft forever
3: eni808caba7453@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
    link/ether c2:be:80:d4:6a:f3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::c0be:80ff:fed4:6af3/64 scope link
       valid_lft forever preferred_lft forever
5: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 192.168.29.128/32 brd 192.168.29.128 scope global tunl0
       valid_lft forever preferred_lft forever
6: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 02:12:58:bb:c6:1a brd ff:ff:ff:ff:ff:ff
    inet 10.254.109.137/26 brd 10.254.109.191 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::12:58ff:febb:c61a/64 scope link
       valid_lft forever preferred_lft forever
7: enia6f1918d9e2@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
    link/ether 96:f5:36:53:e9:55 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::94f5:36ff:fe53:e955/64 scope link
       valid_lft forever preferred_lft forever
8: enia32d23ac2d1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
    link/ether 36:5e:34:a7:82:30 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::345e:34ff:fea7:8230/64 scope link
       valid_lft forever preferred_lft forever
9: cali5e7dde1e39e@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
[ec2-user@ip-10-254-109-184 ~]$

在测试容器pid上输入以获取ip a信息:

[ec2-user@ip-10-254-109-184 ~]$ sudo nsenter -t 15715 -n ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
    link/ether 9a:6d:db:06:74:cb brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.29.129/32 scope global eth0
       valid_lft forever preferred_lft forever

1 个答案:

答案 0 :(得分:0)

我现在不确定确切的解决方案(我尚未在AWS上测试calico,通常我在AWS和物理集群calico上使用amazon-vpc-cni-k8s),但是以下是我们可以快速完成的工作调查。

Calico AWS要求-https://docs.projectcalico.org/v2.3/reference/public-cloud/aws

kubectl get IPPool --all-namespaces
NAME                  AGE
default-ipv4-ippool   15d

kubectl get IPPool default-ipv4-ippool -o yaml


~ calicoctl get nodes
NAME            
node1         
node2        
node3 
node4   

~ calicoctl get workloadendpoints

NODE            ORCHESTRATOR   WORKLOAD                                                   NAME    
node2               k8s            default.myapp-569c54f85-xtktk                   eth0       
node1               k8s            kube-system.calico-kube-controllers-5cbcccc885-b9x8s   eth0   
node1               k8s            kube-system.coredns-fb8b8dcde-2zpw8                    eth0   
node1               k8s            kube-system.coredns-fb8b8dcfg-hc6zv                    eth0 

此外,如果我们可以获得容器网络的详细信息: nsenter -t pid -n ip a

对于主机也是如此: ip a