在Nvidia 1070 Ti Ubuntu 18.04上进行深度学习

时间:2018-08-18 08:36:06

标签: python docker tensorflow nvidia-docker

我现在要拔头发了,我花了很多时间尝试不同的事情,以使我的卡可以使用Tensorflow。

我的 latest 尝试(与以前有类似的问题)是我尝试安装tensorflow docker

https://hub.docker.com/r/tensorflow/tensorflow/

我安装了nvidia-docker并运行了SMI,似乎报告了我的GPU存在。

然后我运行了该命令

nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu

下载并启动后,我尝试运行笔记本(首先是tensorflow笔记本)。

一旦我尝试“导入” tensorflow(仅使用默认未修改的笔记本),我就会得到一个KernelRestart。

KernelRestarter: restarting kernel (1/5), keep random ports

我不确定第二个最佳步骤是什么,我不知道如何对Docker容器进行故障排除,然后再在jupyter笔记本中进行故障排除。

我以前在没有Docker容器的情况下尝试在本地运行时遇到过类似的问题。

关于下一步是什么好的建议?在此卡上,我花了比自己关心的更多的钱,并且对如何使它正常工作没有任何想法。

(我相信我可以使用安装的tensorflow-gpu在我的机器上本地导入,但是当我转到conv2d部分时,我将无法创建cudnn句柄:如果我回想起,则为CUDNN_STATUS_NOT_INITIALIZED,但这已经忙了几天)

编辑:是的,对cuda和cudnn来说,我很容易安装nvidia-390,它似乎似乎是一个很好的测试,就像nvidia-smi一样有效。我刚刚完成了从头开始编译tf的工作,但仍然失败了(在这种情况下,导入tf不会失败,但是同样不会引起初始化错误,也许不是它提到的正确的nvidia版本,并且我认为是nvidia-390.77) 我正在考虑进行全新的18.04安装和较早的nvidia-3xx版本安装,尝试“降级”会导致apt损坏,并且需要多天的时间进行修复

EDIT2: 我还意识到我安装了CUDA 9.0,但是安装了带有9.1 CUDA的cudnn7.1(您可以从nvidia下载该工具,无论它是什么意思)。 我正在尝试还原,但是在退出时遇到了很多麻烦,我几乎要擦除并重新安装ubuntu并从那里去。我拥有所有命令,并认为它可能会更容易,但是我不确定是否能解决问题。 (例如,cudnn-9.0-linux-x64-v7.1)

EDIT3: 回来回应这个。我写下了要使我的GPU在ubuntu 16.04中为我的主机工作的基本知识,但是我没有在docker中进行测试,这就是要点。

https://gist.github.com/onaclov2000/c22fe1456ffa7da6cebd67600003dffb

在此处复制粘贴:

# 1070 Ti
Fresh Install 16.04
(download updates, and include 3rd party)
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install nvidia-384
# Contents
sudo bash -c 'cat >> /etc/modprobe.d/blacklist-nouveau.conf << 'EOF'
blacklist nouveau
options nouveau modeset=0
EOF'
sudo update-initramfs -u
sudo reboot
# Takes about 30-40 minutes 1.5GB approx
wget https://developer.download.nvidia.com/compute/cuda/9.0/secure/Prod/local_installers/cuda_9.0.176_384.81_linux.run
sudo sh cuda_9.0.176_384.81_linux.run
    No to install nvidia accelerated Graphics Driver for Linux
    yes to Cuda 9.0 toolkit
    default
    yes to symbolic link
    yes to samples
    default location is fine


#Alternately (need to test)
#sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --samples

cat >> ~/.bashrc << 'EOF'
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64\
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
EOF
cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery # Assuming make was successful
cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/bandwidthTest
make
./bandwidthTest # Assuming make was successful
# Look for Result = PASS

sudo apt-get install nvidia-cuda-toolkit

# Couldn't find on 16.04 maybe this is a 18.04 upgrade?
#sudo apt-get install cuda-toolkit-9.0 cuda-command-line-tools-9-0

# At this point the driver and CUDA are installed, now it's time to install the CUDNN driver/piece.
#This is the link that I have, be sure to use v7 not v7.1 as I haven't had luck in the past with that (though it might work).
https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-linux-x64-v7
# 333 MB so will take a bit
cd ~/Downloads
tar -xvf cudnn-9.0-linux-x64-v7.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/

sudo apt-get install git tmux
cd ~/Downloads
# At this point I'm going to install Anaconda
wget https://repo.continuum.io/archive/Anaconda3-4.3.1-Linux-x86_64.sh -O anaconda-install.sh 
bash anaconda-install.sh # Follow Prompts adding path to bash
source ~/.bashrc
conda create --name ml
source activate ml
pip install tensorflow-gpu==1.5

# test the install
cd ~
mkdir projects
cd projects
git clone https://github.com/tensorflow/models




# Addional notes
Run a sample from the cuda samples folder

/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery

Output:

Plenty but ends with the following
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2
Result = PASS


This tells you which cudnn is installed

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

Outputs:
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)


# This tells you what

nvcc --version 

Outputs:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

最后,我更新到18.04,但没有再追逐这一切,因此,随着前进,我将在上面的要点中使用18.04版本进行更新。

0 个答案:

没有答案