我正在使用Ubuntu 14.04 LTS运行AWS EC2 g2.2xlarge实例。 我想在训练TensorFlow模型时观察GPU的使用情况。 我试图运行nvidia-smi'。
ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh nvidia-debugdump nvidia-xconfig
nvidia-cuda-mps-control nvidia-persistenced
nvidia-cuda-mps-server nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia
ii nvidia-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-346
ii nvidia-346-dev 346.46-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-346-uvm 346.96-0ubuntu0.0.1 amd64 Transitional package for nvidia-346
ii nvidia-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-375
ii nvidia-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary driver - version 375.39
ii nvidia-375-dev 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-modprobe 375.26-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-icd-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-opencl-icd-352
ii nvidia-opencl-icd-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-opencl-icd-375
ii nvidia-opencl-icd-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.6.2.1 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 375.26-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$
$ inxi -G
Graphics: Card-1: Cirrus Logic GD 5446
Card-2: NVIDIA GK104GL [GRID K520]
X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X
$ lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
Subsystem: XenSource, Inc. Device 0001
Kernel driver in use: cirrus
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
Subsystem: NVIDIA Corporation Device 1014
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
我按照这些说明安装了CUDA 7和cuDNN:
$sudo apt-get -q2 update
$sudo apt-get upgrade
$sudo reboot
=============================================== ========================
重启后,运行' $ sudo update-initramfs -u'
更新initramfs现在,请编辑/etc/modprobe.d/blacklist.conf文件以将黑名单列入黑名单。在编辑器中打开文件,并在文件末尾插入以下行。
黑名单nouveau 黑名单lbm-nouveau options nouveau modeset = 0 别名nouveau off alias lbm-nouveau off
保存并退出文件。
现在安装构建必备工具并更新initramfs并重新启动,如下所示:
$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential
$sudo update-initramfs -u
$sudo reboot
=============================================== =========================
重启后,运行以下命令安装Nvidia。
$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
$sudo chmod 700 ./cuda_7.0.28_linux.run
$sudo ./cuda_7.0.28_linux.run
$sudo update-initramfs -u
$sudo reboot
=============================================== =========================
现在系统已启动,请运行以下命令验证安装。
$sudo modprobe nvidia
$sudo nvidia-smi -q | head`enter code here`
您应该看到输出类似于' nvidia.png'。
现在运行以下命令。 $
cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery
然而,' nvidia-smi'在Tensorflow培训模型时,仍然没有显示GPU活动:
ubuntu@ip-10-0-1-48:~$ ipython
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec 6 2015, 18:08:32)
Type "copyright", "credits" or "license" for more information.
IPython 4.1.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally
ubuntu@ip-10-0-1-48:~$ nvidia-smi
Thu Mar 30 05:45:26 2017
+------------------------------------------------------+
| NVIDIA-SMI 346.46 Driver Version: 346.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 35C P0 38W / 125W | 10MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
答案 0 :(得分:19)
我通过从BIOS禁用安全启动控制,在装有GTX 950m和Ubuntu 18.04的华硕笔记本电脑上解决了“ NVIDIA-SMI失败,因为它无法与NVIDIA驱动程序通信”。
答案 1 :(得分:5)
我在使用K80 GPU的Google Compute Engine中的Ubuntu 16.04(Linux 4.14内核)上遇到了同样的错误。我将内核升级到4.14并且问题解决了。以下是我将Linux内核从4.13升级到4.14的方法:
Step 1:
Check the existing kernel of your Ubuntu Linux:
uname -a
Step 2:
Ubuntu maintains a website for all the versions of kernel that have
been released. At the time of this writing, the latest stable release
of Ubuntu kernel is 4.15. If you go to this
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will
see several links for download.
Step 3:
Download the appropriate files based on the type of OS you have. For 64
bit, I would download the following deb files:
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
Step 4:
Install all the downloaded deb files:
sudo dpkg -i *.deb
Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a
您应该看到您的内核已经升级,并且希望nvidia-smi能够正常工作。
答案 2 :(得分:4)
运行以下命令以获得正确的NVIDIA驱动程序:
sudo ubuntu-drivers devices
然后选择右边并运行:
sudo apt install
答案 3 :(得分:4)
我正在使用AWS DeepAMI P2实例,突然我发现Nvidia驱动程序命令不起作用,并且未找到割炬或tensorflow库的GPU。然后我通过以下方式解决了这个问题,
如果nvcc --version
不起作用,请运行
然后运行以下
apt install nvidia-cuda-toolkit
希望这可以解决问题。
答案 4 :(得分:2)
我发现解决此问题的方法(无论内核版本如何)都是采用WGET选项并易于安装。
sudo apt-get install --reinstall linux-headers-$(uname -r)
驱动程序版本:Ubuntu服务器18.04.4上的390.138
答案 5 :(得分:1)
我只想感谢@Heapify提供了一个实用的答案并更新了他的答案,因为附件的链接不是最新的。
步骤1: 检查您的Ubuntu Linux的现有内核:
uname -a
第2步:
Ubuntu维护一个网站,以提供具有以下功能的所有内核版本: 被释放。在撰写本文时,最新的稳定版本 Ubuntu内核的版本为4.15。如果你去这个 链接:http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/,您将 查看几个下载链接。
第3步:
根据所用操作系统的类型下载适当的文件。对于64 位,我将下载以下deb文件:
// UP-TO-DATE 2019-03-18
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
第4步:
安装所有下载的deb文件:
sudo dpkg -i *.deb
第5步:
重新启动计算机,然后检查内核是否已通过以下方式更新
:uname -aenter code here
答案 6 :(得分:1)
就我而言,以上解决方案均无济于事:
根本原因:gcc版本不兼容
解决方案:
1. sudo apt install --reinstall gcc
2. sudo apt-get --purge -y remove 'nvidia*'
3 sudo apt install nvidia-driver-450
4. sudo reboot
系统:AWS EC2 18.04实例
解决方案来源:https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-in-ubuntu-18-04/68288/4
答案 7 :(得分:0)
对于所有其他有同样问题的人,所有的解决方案都不起作用,好吧,这是我的解决方案,只需禁用安全启动,然后重新安装驱动程序即可。
答案 8 :(得分:0)
我必须在g2.2xlarge Ubuntu 14.04LTS实例上安装带有Tensorflow的NVIDIA 367.57驱动程序和CUDA 7.5。 例如 NVIDIA的图形司机-367_367.57.orig.tar
现在,当我训练张量流模型时,GRID K520 GPU正在工作:
ubuntu@ip-10-0-1-70:~$ nvidia-smi
Sat Apr 1 18:03:32 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 39C P8 43W / 125W | 3800MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2254 C python 3798MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GRID K520"
CUDA Driver Version / Runtime Version 8.0 / 7.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4036 MBytes (4232052736 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 797 MHz (0.80 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS
答案 9 :(得分:0)
我已经在这个问题上苦苦挣扎了两天,在这里分享我的解决方案,以防有人需要它。
我使用的 VM 是在 Azure 平台上带有 2 个 K80 卡的标准 N 系列 GPU 服务器。安装了 Ubuntu 18.04 操作系统。
显然在我遇到这个问题前几天更新了 linux 内核,更新后驱动程序停止工作。
起初,我确实按照上述回复的建议进行了清除和重新安装。什么都行不通。突然(我不记得我为什么要这样做),我更新了我的其中一台 VM 上的默认 gcc 和 g++ 版本,如下所示。
sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90
然后我按照官方文档的说明清除了nvidia软件并重新安装(请为您的系统选择正确的:https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal)。
sudo apt-get purge nvidia-*
然后 nvidia-smi 命令终于再次起作用了。
附注:
如果您像我一样使用 Azure linux VM。安装 CUDA 的推荐方式实际上是在 Azure 门户中启用“NVIDIA GPU Driver Extension”(当然,在您配置了正确的 gcc 版本之后)。
我已经在我的另一个虚拟机上尝试过这种方法,它也能正常工作。
答案 10 :(得分:0)
通过重新安装CUDA解决了问题:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
echo "md5sum: $(md5sum cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb)"
echo "correct: 056de5e03444cce506202f50967b0016"
dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
apt-get -qq update
apt-get -qq -y install cuda
rm cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
答案 11 :(得分:0)
我的系统版本:ubuntu 20.04 LTS。
我通过生成一个新的MOK并将其注册为垫片来解决了这个问题。
在没有禁用安全启动的情况下,尽管它也确实对我有用。
只需执行此命令,然后按照提示执行操作即可:
C:\Users\src\estimation.java:97:35
java: variable taxbracket2 might not have been initialized
根据ubuntu的Wiki: How can I do non-automated signing of drivers
答案 12 :(得分:0)
在更新Linux内核后可能会发生,如果输入此错误,则可以使用以下命令来重建nvidia驱动程序:
dkms
,它可以在内核版本更改后自动重新生成新模块。sudo apt-get install dkms
/usr/src
上检查安装的版本。sudo dkms build -m nvidia -v 440.82
sudo dkms install -m nvidia -v 440.82
现在,您可以检查sudo nvidia-smi
是否可以使用它。
答案 13 :(得分:0)
关于NVIDIA驱动程序的一个鲜为人知的重要事实是它的构建是由DKMS完成的。这样可以在内核升级时自动重建,这是在系统启动时发生的。因此,很容易错过错误消息,尤其是当您在使用云VM或没有附加IPMI /管理界面的服务器上工作时。但是,可以在安装软件包后立即执行 $requestEnvelope = new RequestEnvelope("de_DE");
$setPaymentOptionsRequest = new SetPaymentOptionsRequest();
$setPaymentOptionsRequest->requestEnvelope = $requestEnvelope;
$setPaymentOptionsRequest->payKey = $payKey;
$response = $service->SetPaymentOptions($setPaymentOptionsRequest);
来触发DKMS构建。如果失败,那么您将收到一条有意义的错误消息,即有关缺少依赖项或任何相关性的信息。如果dkms autoinstall
正确地构建了模块,则可以简单地通过dkms autoinstall
加载它-无需重新引导系统(通常将其用作触发DKMS重建的方式)。
您可以查看示例here
答案 14 :(得分:0)
以上都不对我有帮助。
我正在 Google Cloud 上将 Kubernetes 与 tesla k-80 gpu 一起使用。
请按照本指南进行操作,以确保正确安装了所有内容: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
我错过了一些重要的事情:
对于COS节点:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
对于UBUNTU节点:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
确保更新已滚动到您的节点。如果升级已关闭,请重新启动它们。
我在docker中使用此映像 nvidia / cuda:10.1-base-ubuntu16.04
您必须设置GPU限制!这是节点驱动程序可以与Pod通信的唯一方法。在您的yaml配置中,将其添加到您的容器下:
resources:
limits:
nvidia.com/gpu: 1
答案 15 :(得分:0)
我尝试了上述解决方案,但只有以下解决方案对我有用。
sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384
sudo reboot
答案 16 :(得分:-3)
尝试拉出NVIDIA图形卡并重新插入。