我正在尝试运行一些Tensorflow代码,但出现了似乎是一个常见问题:
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 20:36:15.903204: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 20:36:15.908809: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-02-06 20:36:15.908858: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: tigris
2019-02-06 20:36:15.908868: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: tigris
2019-02-06 20:36:15.908942: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 390.77.0
2019-02-06 20:36:15.908985: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.30.0
2019-02-06 20:36:15.909006: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:308] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
$
该错误消息的关键部分似乎是:
[...] libcuda reported version is: 390.77.0
[...] kernel reported version is: 390.30.0
[...] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
如何安装兼容版本? libcuda版本从哪里来?
几个月前,我尝试安装具有GPU支持的Tensorflow,但是这些版本可能无法显示或无法与Tensorflow一起使用。最后,我遵循tutorial的有关如何在同一台计算机上安装CUDA库的多个版本的方法来工作。那在当时是可行的,但是几个月后我回到该项目时,它已经停止工作。我认为那段时间已经升级了一些驱动程序。
我尝试做的第一件事是查看我具有nvidia驱动程序和libcuda软件包的版本。
$ dpkg --list|grep libcuda
ii libcuda1-390 390.30-0ubuntu1 amd64 NVIDIA CUDA runtime library
看起来是390.30。为什么错误消息说libcuda报告390.77?
$ dpkg --list|grep nvidia
ii libnvidia-container-tools 1.0.1-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.1-1 amd64 NVIDIA container runtime library
rc nvidia-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 384.130
ii nvidia-390 390.30-0ubuntu1 amd64 NVIDIA binary driver - version 390.30
ii nvidia-390-dev 390.30-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
rc nvidia-396 396.44-0ubuntu1 amd64 NVIDIA binary driver - version 396.44
ii nvidia-container-runtime 2.0.0+docker18.09.1-1 amd64 NVIDIA container runtime
ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook
ii nvidia-docker2 2.0.3+docker18.09.1-1 all nvidia-docker CLI wrapper
ii nvidia-modprobe 390.30-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
rc nvidia-opencl-icd-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-390 390.30-0ubuntu1 amd64 NVIDIA OpenCL ICD
rc nvidia-opencl-icd-396 396.44-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 396.44-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
再次,一切看起来像是390.30。有些软件包的版本为390.77,但它们的状态为rc
。我猜我安装了该版本,后来又删除了它,因此配置文件被遗忘了。我使用以下命令清除了配置文件:
sudo apt-get remove --purge nvidia-kernel-common-390
现在,版本390.77完全没有软件包。
$ dpkg --list|grep 390.77
$
我尝试重新安装CUDA,以查看其是否使用错误的版本进行编译。
$ sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-9.0 --override
那没什么区别。
最后,我尝试运行nvidia-smi。
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
$
所有这些都在带有Python 3.6.7的Ubuntu 18.04上运行,我的图形卡是NVIDIA Corporation GM107M [GeForce GTX 960M](rev a2)。
答案 0 :(得分:-2)
我终于想到了要搜索名称为390.77的任何文件。
$ locate 390.77
/usr/lib/i386-linux-gnu/libcuda.so.390.77
/usr/lib/i386-linux-gnu/libnvcuvid.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/i386-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
/usr/lib/x86_64-linux-gnu/libcuda.so.390.77
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
就这样!仔细观察表明,我一定已经安装了新版本。
$ ls /usr/lib/i386-linux-gnu/libcuda* -l
lrwxrwxrwx 1 root root 12 Nov 8 13:58 /usr/lib/i386-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Nov 12 14:04 /usr/lib/i386-linux-gnu/libcuda.so.1 -> libcuda.so.390.77
-rw-r--r-- 1 root root 9179124 Jan 31 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.30
-rw-r--r-- 1 root root 9179796 Jul 10 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.77
它们来自哪里?
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.30
libcuda1-390: /usr/lib/i386-linux-gnu/libcuda.so.390.30
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.77
dpkg-query: no path found matching pattern /usr/lib/i386-linux-gnu/libcuda.so.390.77
因此390.77不再属于任何程序包。也许我安装了旧版本并不得不强迫它覆盖链接。
我的计划是删除文件,然后重新安装软件包以设置指向正确版本的链接。那我需要重新安装哪些软件包?
$ locate 390.77|sed -e 's/390.77/390.30/'|xargs dpkg -S
某些文件不匹配任何文件,但匹配的文件来自以下软件包:
我不由自主地删除了390.77版文件。
locate 390.77|sudo xargs rm
然后我重新安装软件包。
sudo apt-get install --reinstall libcuda1-390 nvidia-opencl-icd-390
终于可以了!
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 22:13:59.460822: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 22:13:59.665756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-06 22:13:59.666205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-02-06 22:13:59.666226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-06 22:17:21.254445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-06 22:17:21.254489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-06 22:17:21.254496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-06 22:17:21.290992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3539 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
nvidia-smi
现在也可以使用。
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Wed Feb 6 22:19:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A |
| N/A 45C P8 N/A / N/A | 113MiB / 4046MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3212 G /usr/lib/xorg/Xorg 113MiB |
+-----------------------------------------------------------------------------+
我重新启动,视频驱动程序继续工作。哇!