感谢您提供任何帮助以找到此故障的根本原因。在CentOS 7.5 VM上建立了一个具有1个主节点和2个节点(同一机器上的主节点和一个节点)的kubernetes 1.11.1集群。
xc.Server
节点接口ip地址对我来说很好:
$ uname -a
Linux master.home 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/centos-release
CentOS Linux release 7.5.1804 (Core)
$ docker version
Client:
Version: 17.09.1-ce
API version: 1.32
Go version: go1.8.3
Git commit: 19e2cf6
Built: Thu Dec 7 22:23:40 2017
OS/Arch: linux/amd64
Server:
Version: 17.09.1-ce
API version: 1.32 (minimum version 1.12)
Go version: go1.8.3
Git commit: 19e2cf6
Built: Thu Dec 7 22:25:03 2017
OS/Arch: linux/amd64
Experimental: false
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:08:34Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
etcd集群是健康的:
$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 08:00:27:5b:19:5f brd ff:ff:ff:ff:ff:ff
inet 192.168.1.111/24 brd 192.168.1.255 scope global dynamic enp0s3
valid_lft 82102sec preferred_lft 82102sec
inet6 fe80::a00:27ff:fe5b:195f/64 scope link tentative dadfailed
valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:38:a6:c7:bd brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:38ff:fea6:c7bd/64 scope link
valid_lft forever preferred_lft forever
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 26:1e:c3:e9:a3:db brd ff:ff:ff:ff:ff:ff
inet 10.150.69.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::241e:c3ff:fee9:a3db/64 scope link
valid_lft forever preferred_lft forever
12: vethc0ae215@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 9a:1c:9d:21:18:57 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::981c:9dff:fe21:1857/64 scope link
valid_lft forever preferred_lft forever
还更新了iptables:
$ etcdctl --endpoints=https://192.168.1.111:2379 --cert-file=/var/lib/kubernetes/kubernetes.pem --key-file=/var/lib/kubernetes/kubernetes-key.pem cluster-health
member ca38fd8eb3e17372 is healthy: got healthy result from https://192.168.1.111:2379
cluster is healthy
$ etcdctl --endpoints=https://192.168.1.111:2379 --cert-file=/var/lib/kubernetes/kubernetes.pem --key-file=/var/lib/kubernetes/kubernetes-key.pem get /atomic.io/network/config Network
{ "Network": "10.150.0.0/16", "SubnetLen": 24, "Backend": {"Type": "vxlan"}}
kubectl的概述:
$ iptables --version
iptables v1.6.2
将此清单用于coredns,我将“网络”:“ 10.150.0.0/16”更改为“网络”:“ 10.150.0.0/16”
$ kubectl get all --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-55f86bf584-9vz6k 1/1 Running 11 39m
kube-system pod/coredns-55f86bf584-z4nvv 1/1 Running 11 39m
kube-system pod/kube-flannel-ds-amd64-kw972 0/1 CrashLoopBackOff 6 10m
kube-system pod/kube-flannel-ds-amd64-rhv2c 0/1 CrashLoopBackOff 6 10m
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.32.0.1 <none> 443/TCP 2h
kube-system service/kube-dns ClusterIP 10.32.0.10 <none> 53/UDP,53/TCP 39m
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-flannel-ds-amd64 2 2 0 2 0 beta.kubernetes.io/arch=amd64 10m
kube-system daemonset.apps/kube-flannel-ds-arm 0 0 0 0 0 beta.kubernetes.io/arch=arm 10m
kube-system daemonset.apps/kube-flannel-ds-arm64 0 0 0 0 0 beta.kubernetes.io/arch=arm64 10m
kube-system daemonset.apps/kube-flannel-ds-ppc64le 0 0 0 0 0 beta.kubernetes.io/arch=ppc64le 10m
kube-system daemonset.apps/kube-flannel-ds-s390x 0 0 0 0 0 beta.kubernetes.io/arch=s390x 10m
NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 2 2 2 2 39m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-55f86bf584 2 2 2 39m
这对于coreDNS
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
不确定为什么我会在各自的Pod的日志中看到有关x.509的抱怨:
kubectl apply -f https://storage.googleapis.com/kubernetes-the-hard-way/coredns.yaml
这里192.168.1.111是我的主节点,10.32.0.1是kubernetes服务ip。
我没有使用kubeadm来启动这个集群。通过遵循https://github.com/kelseyhightower/kubernetes-the-hard-way
进行了大多数引导也不确定SNAT是否设置正确:
$ kubectl logs kube-flannel-ds-amd64-kw972 -n kube-system
I1126 14:51:38.415251 1 main.go:475] Determining IP address of default interface
I1126 14:51:38.417393 1 main.go:488] Using interface with name enp0s3 and address 192.168.1.111
I1126 14:51:38.417535 1 main.go:505] Defaulting external address to interface address (192.168.1.111)
E1126 14:51:38.427865 1 main.go:232] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-amd64-kw972': Get https://10.32.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-amd64-kw972: x509: certificate is valid for 192.168.1.111, 127.0.0.1, not 10.32.0.1
$ kubectl logs coredns-55f86bf584-z4nvv -n kube-system
E1126 14:50:51.845470 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.32.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: x509: certificate is valid for 192.168.1.111, 127.0.0.1, not 10.32.0.1
E1126 14:50:51.850446 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.32.0.1:443/api/v1/services?limit=500&resourceVersion=0: x509: certificate is valid for 192.168.1.111, 127.0.0.1, not 10.32.0.1
法兰配置:
$ sudo conntrack -L -d 10.32.0.1
tcp 6 17 TIME_WAIT src=192.168.1.111 dst=10.32.0.1 sport=37862 dport=443 src=192.168.1.111 dst=192.168.1.111 sport=6443 dport=37862 [ASSURED] mark=0 use=1
conntrack v1.4.4 (conntrack-tools): 1 flow entries have been shown.
$ sudo iptables -t nat -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ udp -- !10.150.0.0/16 10.32.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- anywhere 10.32.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-MARK-MASQ tcp -- !10.150.0.0/16 10.32.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- anywhere 10.32.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-MARK-MASQ tcp -- !10.150.0.0/16 10.32.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- anywhere 10.32.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
Edit1 :更新标题以更好地反映潜在的问题。我的目标是确保DNS在我的k8s生态系统中按预期工作。使用busybox图片1.28测试了nslookup。
$ cat /etc/sysconfig/flanneld
# Flanneld configuration options
# etcd url location. Point this to the server where etcd runs
FLANNEL_ETCD_ENDPOINTS="https://192.168.1.111:2379"
# etcd config key. This is the configuration key that flannel queries
# For address range assignment
FLANNEL_ETCD_PREFIX="/atomic.io/network"
# Any additional options that you want to pass
FLANNEL_OPTIONS="-v=9 --etcd-certfile=/var/lib/kubernetes/kubernetes.pem --etcd-keyfile=/var/lib/kubernetes/kubernetes-key.pem --remote-cafile=/var/lib/kubernetes/ca.pem"
更新:在将docker升级到$ kubectl exec -ti busybox -- nslookup kubernetes
Server: 10.32.0.10
Address 1: 10.32.0.10
nslookup: can't resolve 'kubernetes'
command terminated with exit code 1
并编辑kubelet.service文件以使用后,x509错误消失并且coredns已启动并正在运行:
18.06.1-ce
离我们更近了一步,但还没有到。
--container-runtime-endpoint=unix:///var/run/docker/containerd/docker-containerd.sock