NVidia驱动程序停止使用Ubuntu 16.04和Tesla K80 GPU在AWS EC2实例上工作

时间:2019-03-20 13:24:28

标签: amazon-web-services tensorflow amazon-ec2 gpu nvidia

一段时间以来,我一直在使用带有Tesla K80 GPU的AWS EC2实例来运行TensorFlow代码。 我已经安装了CUDA 9.0和cuDNN 7.1.4,并且我使用的是TF 1.12,所有这些都在Ubuntu 16.04上

直到昨天一切都运转良好,但是今天看来NVidia驱动程序由于某种原因已停止运行:

ubuntu@ip-10-0-0-13:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

我检查了驱动程序:

ubuntu@ip-10-0-0-13:~$ dpkg -l | grep nvidia
rc  nvidia-367                              367.48-0ubuntu1                            amd64        NVIDIA binary driver - version 367.48
ii  nvidia-396                              396.37-0ubuntu1                            amd64        NVIDIA binary driver - version 396.37
ii  nvidia-396-dev                          396.37-0ubuntu1                            amd64        NVIDIA binary Xorg driver development files
ii  nvidia-machine-learning-repo-ubuntu1604 1.0.0-1                                    amd64        nvidia-machine-learning repository configuration files
ii  nvidia-modprobe                         396.37-0ubuntu1                            amd64        Load the NVIDIA kernel driver and create device files
rc  nvidia-opencl-icd-367                   367.48-0ubuntu1                            amd64        NVIDIA OpenCL ICD
ii  nvidia-opencl-icd-396                   396.37-0ubuntu1                            amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                            0.8.2                                      amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                         396.37-0ubuntu1                            amd64        Tool for configuring the NVIDIA graphics driver

似乎有两个不同的版本,这可能是一个问题吗? (但我不明白为什么以前一切正常。)

发现this thread后,我检查了我的内核,该内核显然与线程中提到的内核不同:

ubuntu@ip-10-0-0-13:~$ uname -a
Linux ip-10-0-0-13 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

有人遇到这个问题并且知道如何解决吗? 预先感谢您的帮助!

编辑:

当尝试使用@Dehydrated_Mud的方法升级驱动程序时,出现以下错误:

ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.

以及日志文件的内容:

nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Mar 21 10:56:46 2019
installer version: 384.183

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    --no-drm
    --disable-nouveau
    --dkms
    --silent
    --install-libglvnd

Using built-in stream user interface
-> Detected 4 CPUs online; setting concurrency level to 4.
-> Installing NVIDIA driver version 384.183.
-> The NVIDIA driver appears to have been installed previously using a different installer. To prevent potential conflicts, it is recommended either to update the existing installation using the same mechanism by which it was originally installed, or to uninstall the existing installation before installing this driver.

Please review the message provided by the maintainer of this alternate installation method and decide how to proceed:

The package that is already installed is named nvidia-396.

You can upgrade the driver by running:
`apt-get install nvidia-396 nvidia-modprobe nvidia-settings`

You can remove nvidia-396, and all related packages, by running:
`apt-get remove --purge nvidia-396 nvidia-modprobe nvidia-settings`

This package is maintained by NVIDIA (cudatools@nvidia.com).


(Answer: Abort installation)
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.

运行apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'给出:

nvidia-331 - Transitional package for nvidia-331
nvidia-346 - Transitional package for nvidia-346
nvidia-304 - NVIDIA legacy binary driver - version 304.135
nvidia-340 - NVIDIA binary driver - version 340.107
nvidia-361 - Transitional package for nvidia-367
nvidia-352 - Transitional package for nvidia-375
nvidia-367 - Transitional package for nvidia-387
nvidia-375 - Transitional package for nvidia-418
nvidia-387 - NVIDIA binary driver - version 387.26
nvidia-418 - NVIDIA binary driver - version 418.39
nvidia-384 - NVIDIA binary driver - version 384.183
nvidia-390 - NVIDIA binary driver - version 390.116
nvidia-410 - NVIDIA binary driver - version 410.104
nvidia-396 - NVIDIA binary driver - version 396.82

5 个答案:

答案 0 :(得分:5)

我通过更新到最新的Nvidia驱动程序来解决此问题。使用:

nvcc --version

获取cuda工具包的版本号。对于9.0,最新驱动程序是384.183,而CUDA 10.0是410.104。

然后运行:

 wget http://us.download.nvidia.com/tesla/384.183/NVIDIA-Linux-x86_64-384.183.run

下载驱动程序。

然后运行:

sudo sh ./NVIDIA-Linux-x86_64-384.183.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd

安装驱动程序。

运行:

nvidia-smi

检查问题是否已解决。

答案 1 :(得分:0)

#!/bin/bash

set -x

version=$1
#version=410.79
#version=410.104

wget http://us.download.nvidia.com/tesla/${version}/NVIDIA-Linux-x86_64-${version}.run 
sudo sh ./NVIDIA-Linux-x86_64-${version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd 
  1. 将以上内容另存为install.sh
  2. sh install.sh 410.104
  3. sudo modprobe nvidia

GPU应该马上回来,请使用nvidia-smi

答案 2 :(得分:0)

虽然重新安装驱动程序可使驱动程序正常工作,但这不能解决问题,也不是对此问题的正确答案。 我在ubuntu上也观察到了同样的问题,重新安装驱动程序是一个变通办法,直到它再次崩溃为止。这种自发的nvidia cuda驱动程序故障的原因是ubuntu的自动化安全更新。如果有重建内核的更新,它将破坏cuda驱动程序,并且nvidia-smi将无法与驱动程序通信。 一个简单的解决方案是禁用自动安全更新:

sudo apt -y remove unattended-upgrades

答案 3 :(得分:0)

这对我有用:

sudo apt purge nvidia-driver-450
sudo apt autoremove

答案 4 :(得分:-1)

对于多cuda安装,请选择要使用的cuda版本。然后按从最早到最新的顺序安装它们。对于CUDA版本9.0,最新驱动程序为384.183,9.1驱动程序为390.116,而CUDA 10.0驱动程序为410.104。

您可以在以下网站上找到名称,但不要使用.deb文件。

https://www.nvidia.com/Download/Find.aspx

$ cd /usr/local
$ sudo rm cuda
$ sudo ln -s cuda-{$cuda_version} cuda

wget http://us.download.nvidia.com/tesla/${nvidia_version}/NVIDIA-Linux-x86_64-${nvidia_version}.run
>sudo sh ./NVIDIA-Linux-x86_64-${nvidia_version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd