我现在要拔头发了,我花了很多时间尝试不同的事情,以使我的卡可以使用Tensorflow。
我的 latest 尝试(与以前有类似的问题)是我尝试安装tensorflow docker
https://hub.docker.com/r/tensorflow/tensorflow/
我安装了nvidia-docker并运行了SMI,似乎报告了我的GPU存在。
然后我运行了该命令
nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
下载并启动后,我尝试运行笔记本(首先是tensorflow笔记本)。
一旦我尝试“导入” tensorflow(仅使用默认未修改的笔记本),我就会得到一个KernelRestart。
KernelRestarter: restarting kernel (1/5), keep random ports
我不确定第二个最佳步骤是什么,我不知道如何对Docker容器进行故障排除,然后再在jupyter笔记本中进行故障排除。
我以前在没有Docker容器的情况下尝试在本地运行时遇到过类似的问题。
关于下一步是什么好的建议?在此卡上,我花了比自己关心的更多的钱,并且对如何使它正常工作没有任何想法。
(我相信我可以使用安装的tensorflow-gpu在我的机器上本地导入,但是当我转到conv2d部分时,我将无法创建cudnn句柄:如果我回想起,则为CUDNN_STATUS_NOT_INITIALIZED,但这已经忙了几天)
编辑:是的,对cuda和cudnn来说,我很容易安装nvidia-390,它似乎似乎是一个很好的测试,就像nvidia-smi一样有效。我刚刚完成了从头开始编译tf的工作,但仍然失败了(在这种情况下,导入tf不会失败,但是同样不会引起初始化错误,也许不是它提到的正确的nvidia版本,并且我认为是nvidia-390.77) 我正在考虑进行全新的18.04安装和较早的nvidia-3xx版本安装,尝试“降级”会导致apt损坏,并且需要多天的时间进行修复
EDIT2: 我还意识到我安装了CUDA 9.0,但是安装了带有9.1 CUDA的cudnn7.1(您可以从nvidia下载该工具,无论它是什么意思)。 我正在尝试还原,但是在退出时遇到了很多麻烦,我几乎要擦除并重新安装ubuntu并从那里去。我拥有所有命令,并认为它可能会更容易,但是我不确定是否能解决问题。 (例如,cudnn-9.0-linux-x64-v7.1)
EDIT3: 回来回应这个。我写下了要使我的GPU在ubuntu 16.04中为我的主机工作的基本知识,但是我没有在docker中进行测试,这就是要点。
https://gist.github.com/onaclov2000/c22fe1456ffa7da6cebd67600003dffb
在此处复制粘贴:
# 1070 Ti
Fresh Install 16.04
(download updates, and include 3rd party)
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install nvidia-384
# Contents
sudo bash -c 'cat >> /etc/modprobe.d/blacklist-nouveau.conf << 'EOF'
blacklist nouveau
options nouveau modeset=0
EOF'
sudo update-initramfs -u
sudo reboot
# Takes about 30-40 minutes 1.5GB approx
wget https://developer.download.nvidia.com/compute/cuda/9.0/secure/Prod/local_installers/cuda_9.0.176_384.81_linux.run
sudo sh cuda_9.0.176_384.81_linux.run
No to install nvidia accelerated Graphics Driver for Linux
yes to Cuda 9.0 toolkit
default
yes to symbolic link
yes to samples
default location is fine
#Alternately (need to test)
#sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --samples
cat >> ~/.bashrc << 'EOF'
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64\
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
EOF
cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery # Assuming make was successful
cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/bandwidthTest
make
./bandwidthTest # Assuming make was successful
# Look for Result = PASS
sudo apt-get install nvidia-cuda-toolkit
# Couldn't find on 16.04 maybe this is a 18.04 upgrade?
#sudo apt-get install cuda-toolkit-9.0 cuda-command-line-tools-9-0
# At this point the driver and CUDA are installed, now it's time to install the CUDNN driver/piece.
#This is the link that I have, be sure to use v7 not v7.1 as I haven't had luck in the past with that (though it might work).
https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-linux-x64-v7
# 333 MB so will take a bit
cd ~/Downloads
tar -xvf cudnn-9.0-linux-x64-v7.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/
sudo apt-get install git tmux
cd ~/Downloads
# At this point I'm going to install Anaconda
wget https://repo.continuum.io/archive/Anaconda3-4.3.1-Linux-x86_64.sh -O anaconda-install.sh
bash anaconda-install.sh # Follow Prompts adding path to bash
source ~/.bashrc
conda create --name ml
source activate ml
pip install tensorflow-gpu==1.5
# test the install
cd ~
mkdir projects
cd projects
git clone https://github.com/tensorflow/models
# Addional notes
Run a sample from the cuda samples folder
/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery
Output:
Plenty but ends with the following
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2
Result = PASS
This tells you which cudnn is installed
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
Outputs:
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
# This tells you what
nvcc --version
Outputs:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
最后,我更新到18.04,但没有再追逐这一切,因此,随着前进,我将在上面的要点中使用18.04版本进行更新。