我在Python 3.6.4环境中安装了Tensorflow 1.6.0
- 带anaconda的GPU版本。
当我import tensorflow as tf
时,我收到以下错误:
ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory
不同的版本:
cudnn : 7.1.1
cuda : 9.0.176
tensorflow : 1.6.0
Ubuntu : 16.04
我知道this,但它没有解决我的问题。
答案 0 :(得分:4)
接受的答案是错误的(安装nvidia-cuda-toolkit
)。通过安装工具包,您基本上是在nvidia指南中已安装的cuda之上安装第二个CUDA。
该问题原来是符号链接的问题。灵感来自这个主题http://queirozf.com/entries/installing-cuda-tk-and-tensorflow-on-a-clean-ubuntu-16-04-install 但实际分辨率不同
因此,CuDNN
安装nvidia
教程中的某个时刻会要求您这样做:
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
此方法的问题在于,使用过滤器libcudnn*
复制文件将破坏复制文件的符号链接。相反,我建议运行以下命令,但是它仍然会断开链接:
sudo cp --preserve=links cuda/lib64/libcudnn* /usr/local/cuda/lib64
您可以通过运行ls -lha libcudnn*
文件夹中的/usr/local/cuda/lib64
来验证链接。如果您碰巧看不到这样的图片:
lrwxrwxrwx 1根根13 May 2 20:02 libcudnn.so-> libcudnn.so.7
lrwxrwxrwx 1根root 5月2日20:02 libcudnn.so.7-> libcudnn.so.7.6.5
-rwxr-xr-x 1根409M 5月2日20:02 libcudnn.so.7.6.5
-rw-r--r-- 1个根386M 5月2日20:02 libcudnn_static.a
然后,您刚刚发现了问题。实际的解决方案包括执行以下操作:
sudo rm /usr/local/cuda/lib64/libcudnn.so
sudo rm /usr/local/cuda/lib64/libcudnn.so.7
cd /usr/local/cuda/lib64/
sudo ln -s libcudnn.so.7.6.5 libcudnn.so.7
sudo ln -s libcudnn.so.7 libcudnn.so
删除旧的“链接”并创建新的链接。使用ls -lha libcudnn*
再次验证链接。之后,以详细模式运行以下命令:
sudo ldconfig -v
检查日志。我不知道它到底是做什么的,但是事实证明,这很重要。另外,如果日志显示符号链接已损坏或沿这些行显示内容,则tensorflow
将继续显示该主题中提到的错误。
奖金!,请确保您在以下行中附加了以下路径:nano ~/.bashrc
export PATH=/usr/local/cuda/bin:/opt/nvidia/nsight-compute/2019.4.0${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDADIR=/usr/local/cuda${CUDADIR:+:${CUDADIR}}
export CUDA_HOME=/usr/local/cuda
然后运行命令source ~/.bashrc
以上所有步骤均假设您未使用nvidia-cuda-toolkit
,而是使用了nvidia
cuda repo。
此外,在安装CUDA时,请确保您未针对10.2
。在编写时,TF支持的版本最高为Cuda 10.1
,因此以下是安装必要版本的正确方法:
sudo apt-cache policy cuda
sudo apt-get install cuda=10.1.243-1
验证依据:
nvcc --version
nvidia-smi
编辑:我发现在运行ldconfig
命令后您应该避免看到的错误:
/usr/local/cuda-10.1/targets/x86_64-linux/lib:
...
libnppist.so.10-> libnppist.so.10.2.0.243
libcuinj64.so.10.1-> libcuinj64.so.10.1.243
> /sbin/ldconfig.real:/usr/local/cuda-10.1/targets/x86_64-linux/lib /libcudnn.so.7不是符号链接
libcudnn.so.7-> libcudnn.so.7.6.5
libnppc.so.10-> libnppc.so.10.2.0.243
libnppicom.so.10-> libnppicom.so.10.2.0.243
libnvgraph.so.10-> libnvgraph.so.10.1.243
/ usr / lib / x86_64-linux-gnu / libfakeroot:
...
如果看到它,则说明仍然配置有误。
答案 1 :(得分:1)
我安装了nvidia-cuda-toolkit
包:
$ sudo apt install nvidia-cuda-toolkit
并且有效。
我没有在tensorflow网站上找到解决方案,也没有找到nvidia安装页面。我找到了通过命令行获取cuda版本的方法,我找到了运气:How to get the cuda version?
答案 2 :(得分:1)
我没有足够的声誉来评论亚历克斯的答案。但是现在在Ubuntu 20.04上,路径已更改!另外,现在--preserve=links
时无需cp
!所以我应该发布一个新答案:
在conda create --name tfgpu10.1 python=3.8
创建的环境中,为带有CUDA 10.1的TensorFlow 2.3.1安装cuDNN库7.6:
tar -xvzf cudnn-10.1-linux-x64-v7.6.5.32.tgz
sudo cp cuda/include/cudnn.h /usr/lib/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/lib/cuda/lib64/
sudo chmod a+r /usr/lib/cuda/include/cudnn.h /usr/lib/cuda/lib64/libcudnn*
测试结果:
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-12-02 03:58:41.089993: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.config.list_physical_devices("GPU")
2020-12-02 03:58:48.538295: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-12-02 03:58:48.587523: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-02 03:58:48.587838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2020-12-02 03:58:48.587860: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-02 03:58:48.589111: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-02 03:58:48.590284: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-12-02 03:58:48.590488: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-12-02 03:58:48.591785: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-12-02 03:58:48.592520: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-12-02 03:58:48.595129: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-12-02 03:58:48.595213: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-02 03:58:48.595555: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-02 03:58:48.595815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
答案 3 :(得分:-4)
这对我来说不起作用,在我的情况下,这是因为我安装了多个版本的Cuda,并且我的cudnn版本是旧版本而不是我试图使用的版本所以我安装了继nvidia的instructions之后的新版本的cudnn,并为我做了。