在安装NVIDIA Clara Deploy时出现“服务器没有资源类型“ pods””

时间:2020-10-17 21:14:33

标签: kubernetes nvidia

我正在尝试根据官方文档(thisthis)安装最新版本的NVIDIA Clara Deploy Bootstrap。在安装的第一步,它们是一个名为“ bootstrap.sh”的shellscript-用于安装所有依赖项,包括Kubernetes和kubectl,以及创建集群。但是在运行sudo ./bootstrap.sh时,出现以下错误:error: the server doesn't have a resource type "pods"

我到目前为止所做的事情: 我对Kubernetes相当陌生。因此,我尝试了this answer的解决方案,并尝试运行kubectl get pods,这给了我No resources found.。我也尝试过kubectl auth can-i get pods,这给了我yes。在etc / kubernetes / manifests内部,它是空的,应该包含我从答案中查找的conf文件,因此我运行了sudo kubeadm init

以下是完整的错误消息:

2020-10-17 20:57:37 [INFO]: Clara Deploy SDK System Prerequisites Installation
2020-10-17 20:57:37 [INFO]: Checking user privilege...
 
2020-10-17 20:57:37 [INFO]: Checking for NVIDIA GPU driver...
2020-10-17 20:57:37 [INFO]: NVIDIA CUDA driver version found: 418.87.01
2020-10-17 20:57:37 [INFO]: NVIDIA GPU driver found
2020-10-17 20:57:37 [INFO]: Check and install required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-release
 dirmngr jq ...
Ign:1 http://deb.debian.org/debian stretch InRelease
Get:2 http://security.debian.org stretch/updates InRelease [53.0 kB]
Get:3 http://deb.debian.org/debian stretch-updates InRelease [93.6 kB]          
Get:4 http://deb.debian.org/debian stretch-backports InRelease [91.8 kB]               
Hit:5 http://deb.debian.org/debian stretch Release 
Hit:6 http://packages.cloud.google.com/apt gcsfuse-stretch InRelease
Get:7 https://download.docker.com/linux/debian stretch InRelease [44.8 kB]
Get:8 http://packages.cloud.google.com/apt cloud-sdk-stretch InRelease [6,389 B]                                       
Get:9 http://security.debian.org stretch/updates/main Sources [263 kB]                            
Hit:10 http://packages.cloud.google.com/apt google-compute-engine-stretch-stable InRelease             
Get:11 http://security.debian.org stretch/updates/main amd64 Packages [604 kB]                                       
Get:12 http://security.debian.org stretch/updates/main Translation-en [267 kB]                                                 
Hit:13 http://packages.cloud.google.com/apt google-cloud-packages-archive-keyring-stretch InRelease                                   
Hit:14 https://nvidia.github.io/libnvidia-container/stable/debian9/amd64  InRelease            
Hit:16 https://nvidia.github.io/nvidia-container-runtime/stable/debian9/amd64  InRelease
Hit:15 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Hit:18 https://nvidia.github.io/nvidia-docker/debian9/amd64  InRelease
Fetched 1,424 kB in 1s (1,175 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
apt-transport-https is already the newest version (1.4.10).
ca-certificates is already the newest version (20200601~deb9u1).
dirmngr is already the newest version (2.1.18-8~deb9u4).
jq is already the newest version (1.5+dfsg-1.3).
lsb-release is already the newest version (9.20161125).
network-manager is already the newest version (1.6.2-3+deb9u2).
unzip is already the newest version (6.0-21+deb9u2).
curl is already the newest version (7.52.1-5+deb9u12).
software-properties-common is already the newest version (0.96.20.2-1+deb9u1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
2020-10-17 20:57:41 [INFO]: Starting network-manager service...
2020-10-17 20:57:41 [INFO]: Successfully installed required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-re
lease dirmngr jq !
2020-10-17 20:57:41 [INFO]: Disabling swap ...
2020-10-17 20:57:41 [INFO]: Start installing docker and nvidia-docker2 ...
2020-10-17 20:57:41 [INFO]: 'proteeti_prova' is already added to docker group. Skipping docker group configuration ...
2020-10-17 20:57:41 [INFO]: Skipping nvidia-docker install since it is already present.
WARNING: No swap limit support
2020-10-17 20:57:42 [INFO]: Docker Compose version 1.25.4 is already installed. Skipping docker-compose installation...
2020-10-17 20:57:42 [INFO]: The following versions of k8s components are already installed.
Error from server (NotFound): the server could not find the requested resource
2020-10-17 20:57:43 [INFO]: - kubectl: Client Version: v1.15.4
2020-10-17 20:57:43 [INFO]: - kubelet: Kubernetes v1.15.4
2020-10-17 20:57:44 [INFO]: - kubeadm: v1.15.4
2020-10-17 20:57:45 [INFO]: Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
error: the server doesn't have a resource type "pods"

1 个答案:

答案 0 :(得分:2)

1。实例:

GCP, Ubuntu 18.04
n1-standard-16 (16 vCPUs, 60 GB memory)
1 x NVIDIA Tesla T4

2。。下载引导程序并解压缩:

$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_bootstrap/versions/0.7.1-2008.1/files/bootstrap.zip
$unzip bootstrap.zip -d bootstrap

3。。先安装cuda并重新启动:

$wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
$sudo apt-get update
$sudo apt-get -y install cuda
$sudo reboot

4。。重新启动后启用IP Forwarding

$sudo -s
#echo 1 > /proc/sys/net/ipv4/ip_forward

5。。(第一次)运行bootstrap.sh

kubelet.service显示code=exited, status=255错误:

$sudo ./bootstrap/bootstrap.sh
...
...
● kubelet.service - kubelet: The Kubernetes Node Agent
       Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
      Drop-In: /etc/systemd/system/kubelet.service.d
               └─10-kubeadm.conf
       Active: activating (auto-restart) (Result: exit-code) since Mon 2020-10-19 10:40:54 UTC; 2s ago
         Docs: https://kubernetes.io/docs/home/
      Process: 2356 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
     Main PID: 2356 (code=exited, status=255)

此错误表示您应该手动运行kubeadm init。因此,运行kubeadm init --pod-network-cidr=10.244.0.0/16,然后再次检查sudo service kubelet status以确保它按预期运行。所有的kubernetes配置都将在kubeadm init --pod-network-cidr=10.244.0.0/16期间为您生成。

6。。我们添加--pod-network-cidr=10.244.0.0/16是因为我们将使用Flannel CNI。您可以在bootstrap.sh的第334行if ! sudo kubeadm init --pod-network-cidr="10.244.0.0/16"; then

中进行检查
$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.15.12
[preflight] Pulling images required for setting up a Kubernetes cluster
...
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
...
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
...
[apiclient] All control plane components are healthy after 19.501975 seconds
...
Your Kubernetes control-plane has initialized successfully!.
...
$ sudo service kubelet status
    ● kubelet.service - kubelet: The Kubernetes Node Agent
       Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
      Drop-In: /etc/systemd/system/kubelet.service.d
               └─10-kubeadm.conf
       Active: active (running) since Mon 2020-10-19 13:42:22 UTC; 4min 15s ago

7。。下一步是常规步骤,可以从您的用户而不是root

运行kubectl命令
$mkdir -p $HOME/.kube
$sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$sudo chown $(id -u):$(id -g) $HOME/.kube/config

8。。显示当前安装的所有内容

$ kubectl get all -A
NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-5c98db65d4-cpz4s        0/1     Pending   0          4m17s
kube-system   pod/coredns-5c98db65d4-kgzg8        0/1     Pending   0          4m17s
kube-system   pod/etcd-clara                      1/1     Running   0          3m10s
kube-system   pod/kube-apiserver-clara            1/1     Running   0          3m35s
kube-system   pod/kube-controller-manager-clara   1/1     Running   0          3m17s
kube-system   pod/kube-proxy-8qx4z                1/1     Running   0          4m18s
kube-system   pod/kube-scheduler-clara            1/1     Running   0          3m23s
    
    
NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  4m35s
kube-system   service/kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   4m34s
    
NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
kube-system   daemonset.apps/kube-proxy   1         1         1       1            1           beta.kubernetes.io/os=linux   4m33s
    
NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           4m34s
    
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5c98db65d4   2         2         0       4m18s

请注意:当前coredns pods处于Pending状态。您还可以看到尚未准备就绪的coredns deploymentreplicaset

NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           4m34s
    
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5c98db65d4   2         2         0       4m18s

他们一直等到您将应用法兰绒配置yaml。 这些是来自同一脚本的行

info "Deploy kubernetes pod network."
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel.yml
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel-rbac.yml

如果您现在不执行此操作并重新运行脚本,则会收到超时错误

2020-10-19 14:14:03 [INFO]: coredns pods are not running yet ...

9。。部署法兰绒

$ kubectl apply -f bootstrap/kube-flannel.yml
podsecuritypolicy.extensions/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.extensions/kube-flannel-ds-amd64 created
daemonset.extensions/kube-flannel-ds-arm64 created
daemonset.extensions/kube-flannel-ds-arm created
daemonset.extensions/kube-flannel-ds-ppc64le created
daemonset.extensions/kube-flannel-ds-s390x created
    
$ kubectl apply -f bootstrap/kube-flannel-rbac.yml
clusterrole.rbac.authorization.k8s.io/flannel configured
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged

此后,与coredns相关的所有内容将立即开始工作。 Pods将被创建并处于Running状态,deploymentreplicaset将处于正确状态。

NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-5c98db65d4-cpz4s        1/1     Running   0          21m
kube-system   pod/coredns-5c98db65d4-kgzg8        1/1     Running   0          21m
    
NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   2/2     2            2           21m
    
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5c98db65d4   2         2         2       21m

此外,您还会看到与法兰绒相关的新poddaemonsets

kube-system   pod/kube-flannel-ds-amd64-64jbv     1/1     Running   0          3m59s
    
    
NAMESPACE     NAME                                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
kube-system   daemonset.apps/kube-flannel-ds-amd64     1         1         1       1            1           beta.kubernetes.io/arch=amd64     3m59s
kube-system   daemonset.apps/kube-flannel-ds-arm       0         0         0       0            0           beta.kubernetes.io/arch=arm       3m59s
kube-system   daemonset.apps/kube-flannel-ds-arm64     0         0         0       0            0           beta.kubernetes.io/arch=arm64     3m59s
kube-system   daemonset.apps/kube-flannel-ds-ppc64le   0         0         0       0            0           beta.kubernetes.io/arch=ppc64le   3m59s
kube-system   daemonset.apps/kube-flannel-ds-s390x     0         0         0       0            0           beta.kubernetes.io/arch=s390x     3m59s

10。。终于可以继续运行脚本了。它会尝试!!!安装helmtiller并重新启动dockerd。除了TILLER ...

$sudo ./bootstrap/bootstrap.sh
[INFO]: Clara Deploy SDK System Prerequisites Installation
...
Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
./bootstrap/bootstrap.sh: line 412: helm: command not found
...
[INFO]: Start installing helm ...
...
[INFO]: Restarting dockerd...
The connection to the server *.*.*.*:6443 was refused - did you specify the right host or port?
[INFO]: Waiting for Kubernetes to be ready...
Kubernetes master is running at https://*.*.*.*:6443
KubeDNS is running at https://*.*.*.*:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
...
[INFO]: Updating permissions...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...

11。。我们没有Tiller吊舱。结果,部署和副本集也被破坏了……

kube-system   deployment.apps/tiller-deploy   0/1  0 0 7m26s
kube-system   replicaset.apps/tiller-deploy-659c6788f5   1 0 0 7m26s

我在这里没有看到其他解决方案,而是手动删除分till的相关组件(部署,服务)并从头开始安装。.采用小解决方法。

#delete tiller
$kubectl delete deployment tiller-deploy -n kube-system
$kubectl delete deployment tiller-deploy -n kube-system
    
#install helm,tiller
$curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
$kubectl create serviceaccount --namespace kube-system tiller
$kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$helm init --service-account tiller

现在,如果您要检查已部署的内容-您将清楚地看到tiller-pod处于待处理状态,就像tiller-deploy部署尚未准备就绪

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE
kube-system   pod/tiller-deploy-67847cd9b9-vlzm6   0/1     Pending   0          11m
    
NAMESPACE     NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/tiller-deploy   0/1     1            0           11m
    
NAMESPACE     NAME                                       DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/tiller-deploy-67847cd9b9   1         1         0       11m

12。固定耕作机

让我们描述分till荚并找到tolerations

$ kubectl describe pod tiller-deploy-67847cd9b9-vlzm6 -n kube-system
    Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                     node.kubernetes.io/unreachable:NoExecute for 300s

我不会解释原因(您将自己了解容差),但解决方法是允许主运行容器...

$kubectl taint nodes --all node-role.kubernetes.io/master-

之后,您将看到

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE
kube-system   pod/tiller-deploy-67847cd9b9-vlzm6   1/1     Running   0          13m
    
NAMESPACE     NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/tiller-deploy   1/1     1            1           13m
    
NAMESPACE     NAME                                       DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/tiller-deploy-67847cd9b9   1         1         1       13m

13。。接下来,安装所有组件:

$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_cli/versions/0.7.1-2008.1/files/cli.zip
$sudo unzip cli.zip -d /usr/bin/ && sudo chmod 755 /usr/bin/clara*
    
$ clara version
Clara CLI version: 0.7.1-12788.ae65aea0
$ clara config --key KEY --orgteam nvidia/clara -y
Configuration "ngc-clara"successfully created
    
$ clara pull platform
Clara Platform 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara
    
$ clara platform start
Starting clara...
NAME:   clara
    
$ clara pull dicom
Clara Dicom Adapter 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/dicom-adapter
    
$ clara pull render
Clara Renderer 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-renderer
    
$ clara pull monitor
Clara Monitor Server 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-monitor-server
    
$ clara pull console
Clara Management Console 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-console
    
$ clara dicom start
Starting DICOM Adapter...
NAME: clara-dicom-adapter
$ clara render start
NAME: clara-render-server
$ clara monitor start
NAME: clara-monitor-server
$ clara console start
NAME: clara-console

14。。要验证安装是否成功,请运行以下命令:

$ helm ls
NAME                    REVISION        UPDATED                         STATUS          CHART                                   APP VERSION     NAMESPACE
clara                   1               Mon Oct 19 16:16:36 2020        DEPLOYED        clara-0.7.1-2008.1                      1.0             default  
clara-console           1               Mon Oct 19 16:28:30 2020        DEPLOYED        clara-console-0.7.1-2008.1              1.0             default  
clara-dicom-adapter     1               Mon Oct 19 16:22:36 2020        DEPLOYED        dicom-adapter-0.7.1-2008.1              1.0             default  
clara-monitor-server    1               Mon Oct 19 16:26:35 2020        DEPLOYED        clara-monitor-server-0.7.1-2008.1       1.0             default  
clara-render-server     1               Mon Oct 19 16:22:54 2020        DEPLOYED        clara-renderer-0.7.1-2008.1             1.0             default  
    
    
$ kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
clara-clara-platformapiserver-54c5c44bbd-gqdd6         1/1     Running   0          13m
clara-console-8565b4d565-wcbg5                         2/2     Running   0          2m2s
clara-console-mongodb-85f8bd5f95-ts2gp                 1/1     Running   0          2m2s
clara-dicom-adapter-7948fcd445-mnsjd                   1/1     Running   0          7m56s
clara-monitor-server-fluentd-elasticsearch-6zvhq       1/1     Running   0          3m57s
clara-monitor-server-grafana-5f874b974d-6l4s8          1/1     Running   0          3m57s
clara-monitor-server-monitor-server-59c8bf68f7-5dgxq   1/1     Running   0          3m57s
clara-render-server-clara-renderer-d79dd4779-wcjrv     3/3     Running   0          7m38s
clara-resultsservice-664477898f-9nk4f                  1/1     Running   0          13m
clara-ui-6f89b97df8-792f6                              1/1     Running   0          13m
clara-workflow-controller-69cbb55fc8-zjhdm             1/1     Running   0          13m
elasticsearch-master-0                                 1/1     Running   0          3m57s
elasticsearch-master-1                                 1/1     Running   0          3m57s
fluentd-km8nj                                          1/1     Running   0          13m

P.S。当然,为您修复脚本要容易得多,但是我决定向您展示后台发生了什么。我确定如果需要的话,您会自己做。