今天,我更新了实验室中DGX站的GPU驱动程序。我遵循了这个guide,它使用GUI方法来更新所有内容(包括docker)
GPU驱动程序已成功从384.125
升级到430.64
。但是码头工人不知何故被打破了。我决定删除损坏的泊坞窗,然后自己重新安装所有内容。
然后我遵循此guide,它指导您如何删除旧版本的docker并安装新的docker。测试代码sudo docker run hello-world
运行完美,表明Docker已正确安装。
要在docker内部使用GPU,我按照此guide安装了nvidia-container-runtime
。一切正常,我可以创建新的docker并在其中使用GPU。
但是,当我尝试通过docker start old_containers
使用以前的容器时,发生了以下错误
守护程序的错误响应:获取nvidia_driver_384.125:时出错 检查驱动程序中是否存在卷“ nvidia_driver_384.125” “ nvidia-docker”:查找卷插件时出错nvidia-docker:插件 找不到“ nvidia-docker”错误:无法启动容器: old_containers
根据此github帖子,我需要运行sudo service nvidia-docker start
或sudo nvidia-docker-plugin
。他们两个都给我命令未找到错误。
所以我想我缺少nvidia-docker插件,然后尝试通过此guide安装它。命令 sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
给我以下错误
以下软件包具有未满足的依赖性: nvidia-container-toolkit:取决于:libnvidia-container-tools(> = 1.2.0),但要安装1.0.1-1 E:无法纠正问题,您拿着损坏的包装。
所以我尝试做sudo apt install libnvidia-container-tools=1.2.0
,但似乎没有1.2.0版本
正在读取包列表...完成构建依赖关系树的读取 状态信息...完成E:版本“ 1.2.0” 找不到“ libnvidia-container-tools”
做sudo apt install libnvidia-container-tools
给我
正在读取包列表...完成构建依赖关系树的读取 状态信息...完成的libnvidia-container-tools已经 最新版本(1.0.1-1)。 0升级,0新安装,0删除 和0未升级。
nvidia-smi
显示
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-DGXS... Off | 00000000:07:00.0 On | 0 |
| N/A 40C P0 39W / 300W | 355MiB / 32505MiB | 16% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-DGXS... Off | 00000000:08:00.0 Off | 0 |
| N/A 39C P0 39W / 300W | 0MiB / 32508MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-DGXS... Off | 00000000:0E:00.0 Off | 0 |
| N/A 38C P0 40W / 300W | 0MiB / 32508MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-DGXS... Off | 00000000:0F:00.0 Off | 0 |
| N/A 39C P0 38W / 300W | 0MiB / 32508MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1585 G /usr/lib/xorg/Xorg 179MiB |
| 0 2994 G compiz 163MiB |
| 0 11416 G /usr/lib/firefox/firefox 10MiB |
+-----------------------------------------------------------------------------+
基本系统中的 nvcc --version
显示
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
基本系统中的 dpkg -l | grep -i docker
显示
ii dgx-docker-cleanup 1.0-1 amd64 DGX Docker cleanup script
rc dgx-docker-options 1.0-7 amd64 DGX docker daemon options
ii dgx-docker-repo 1.0-1 amd64 docker repository configuration file
ii docker-ce 5:19.03.12~3-0~ubuntu-xenial amd64 Docker: the open-source application container engine
ii docker-ce-cli 5:19.03.12~3-0~ubuntu-xenial amd64 Docker CLI: the open-source application container engine
ii nvidia-container-runtime 2.0.0+docker18.09.2-1 amd64 NVIDIA container runtime
docker version
显示
Client: Docker Engine - Community
Version: 19.03.12
API version: 1.40
Go version: go1.13.10
Git commit: 48a66213fe
Built: Mon Jun 22 15:45:49 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.12
API version: 1.40 (minimum version 1.12)
Go version: go1.13.10
Git commit: 48a66213fe
Built: Mon Jun 22 15:44:20 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.2
GitCommit: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc:
Version: 1.0.0-rc6+dev
GitCommit: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec
docker-init:
Version: 0.18.0
GitCommit: fec3683
lsb_release -a
显示
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.6 LTS
Release: 16.04
Codename: xenial