在星期二重新启动实例时,我首先遇到了在Ubuntu Deep Learning AMI的AWS p2.xlarge机器上失去GPU支持的问题。
我现在在两天内对其进行了三次测试,而一个同事也遇到了同样的问题,所以我猜这是一个AWS错误。尽管也许有人对如何更好地进行调试有所了解。
基本上,关闭并重新启动后,该实例不再在内核中加载nvidia模块。此外,根据dmesg,似乎加载了另一个内核。所有这些事情都是在没有我积极引起的情况下发生的。
以下是使用新实例且无自定义代码来重现问题的步骤。我在爱尔兰(eu-west-1)工作,该实例在可用区eu-west-1a中启动:
ubuntu@...:~$ lsmod | grep nvidia
nvidia 16592896 0
ipmi_msghandler 49152 1 nvidia
dmesg | less
...
[ 0.000000] Linux version 4.4.0-1075-aws (buildd@lgw01-amd64-035) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #85-Ubuntu SMP Thu Jan 17 17:15:12 UTC 2019 (Ubuntu 4.4.0-1075.85-aws 4.4.167)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1075-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...
ubuntu@...:~$ nvidia-smi
Tue Mar 19 16:41:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 42C P8 32W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
ubuntu@...:~$ sudo shutdown now
ubuntu@...:~$ lsmod | grep nvidia
(no output)
dmesg | less
...
[ 0.000000] Linux version 4.4.0-1077-aws (buildd@lcy01-amd64-021) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 (Ubuntu 4.4.0-1077.87-aws 4.4.170)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1077-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...
ubuntu@...:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
如何强制使用4.4.0-1075-aws内核引导?由于是hvm虚拟化,因此无法在对话框中直接选择内核。
答案 0 :(得分:7)
在4.4.0-107x-aws内核上构建较早的NVIDIA驱动程序似乎存在问题。您可以安装较新的NVIDIA驱动程序,该驱动程序应与当前内核兼容:
wget http://us.download.nvidia.com/tesla/410.104/NVIDIA-Linux-x86_64-410.104.run
sudo sh ./NVIDIA-Linux-x86_64-410.104.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd
根据AWS代表的说法,驱动程序已于2019年3月21日在深度学习AMI中进行了更新[AWS forums]。
答案 1 :(得分:3)
I experienced the same issue and it helped me to do
sudo apt-get install nvidia-cuda-toolkit
sudo reboot
Good luck!