Question

我已升级到Tensorflow版本1.0，并安装了CUDA 8.0和cudnn 5.1版本以及nvidia驱动程序，最新版本为375.39。我的NVIDIA硬件是使用p2.xlarge实例（特斯拉K-80）在Amazon Web Services上的硬件。我的操作系统是Linux 64位。

每次使用命令时都会收到下一条错误消息：tf.Session（）

[ec2-user@ip-172-31-7-96 CUDA]$ python
Python 2.7.12 (default, Sep  1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
>>> sess = tf.Session()
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ip-172-31-7-96
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ip-172-31-7-96
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Invalid argument: expected %d.%d or %d.%d.%d form for driver version; got "1"
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.39  Tue Jan 31 20:47:00 PST 2017
GCC version:  gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.39.0

我完全不清楚如何解决这个问题。我尝试过不同版本的Nvidia驱动程序和CUDA，但它仍无效。

任何提示都将受到赞赏。

Answer 1

您需要安装NVIDIA驱动程序并运行CUDA 8.0安装程序。

# Requirements
# - NVIDIA Driver - NVIDIA-Linux-x86_64-375.39.run - http://www.nvidia.fr/Download/index.aspx
# - CUDA runfile (local) - cuda_8.0.61_375.26_linux.run - https://developer.nvidia.com/cuda-downloads
# - cudnn-8.0-linux-x64-v5.0-ga.tgz

sudo apt update -y && sudo apt upgrade -y
sudo apt install build-essential linux-image-extra-`uname -r` -y

chmod +x NVIDIA-Linux-x86_64-375.39.run
sudo ./NVIDIA-Linux-x86_64-375.39.run

chmod +x cuda_8.0.61_375.26_linux.run
./cuda_8.0.61_375.26_linux.run --extract=`pwd`/extracts
sudo ./extracts/cuda-linux64-rel-8.0.61-21551265.run

echo -e "export CUDA_HOME=/usr/local/cuda\nexport PATH=\$PATH:\$CUDA_HOME/bin\nexport LD_LIBRARY_PATH=\$LD_LINKER_PATH:\$CUDA_HOME/lib64" >> ~/.bashrc
source .bashrc

tar xf cudnn-8.0-linux-x64-v5.0-ga.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/cudnn.h /usr/local/cuda/include/

Answer 2

卸载驱动程序＆amp; cuda，然后按official guide重新安装。

运行deviceQuery以检查设备是否安装正确。

Answer 3

你也可以试试＆＃34; NVIDIA Volta Deep Learning AMI＆＃34;使用p3（v100 GPU）实例。

注册https://www.nvidia.com/en-us/gpu-cloud/?ncid=van-gpu-cloud并获取您的＆＃34; API密钥＆＃34;免费使用AMI。

EC2 / GPU配置信息：https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/

Answer 4

AWS Deep Learning AMI已预安装CUDA 8、9和10，因此您现在不必执行此安装。

参考：https://docs.aws.amazon.com/dlami/latest/devguide/overview-cuda.html

如何在AWS p2.xlarge实例中最新版本的Tensorflow（1.0）中安装CUDA 8.0，AMI ami-edb11e8d和nvidia驱动程序是最新的（375.39）

4 个答案: