Question

我有一个包含8个GPU的集群，我想在其上运行python脚本。我知道脚本很好，因为它运行在单个GPU集群上。但是，当尝试在此8 gpu群集上运行时，我收到以下错误消息：

to use: AVX2 AVX512F FMA
2018-03-29 18:42:51.800702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3d:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:52.347624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3e:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:52.882324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:60:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:53.591909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:61:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:54.149671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 4 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b1:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:54.715701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 5 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b2:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.286011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 6 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:da:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.874676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 7 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:db:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.929779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1227] Device peer to peer matrix
2018-03-29 18:42:55.930506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1233] DMA: 0 1 2 3 4 5 6 7
2018-03-29 18:42:55.930524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 0:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 1:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 2:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 3:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 4:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 5:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 6:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 7:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2018-03-29 18:43:00.106517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10415 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:3d:00.0, compute capability: 6.1)
2018-03-29 18:43:00.572522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-29 18:43:01.039866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10415 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:60:00.0, compute capability: 6.1)
2018-03-29 18:43:01.512332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10415 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:61:00.0, compute capability: 6.1)
2018-03-29 18:43:02.036327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10415 MB memory) -> physical GPU (device: 4, name: GeForce GTX 1080 Ti, pci bus id: 0000:b1:00.0, compute capability: 6.1)
2018-03-29 18:43:02.679167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10415 MB memory) -> physical GPU (device: 5, name: GeForce GTX 1080 Ti, pci bus id: 0000:b2:00.0, compute capability: 6.1) 
killed

它只是简单地说killed，我不确定为什么会出现这种错误。我尝试使用以下命令仅指定两个GPU：

CUDA_VISIBLE_DEVICES=0,1 python3 my_script.py

但是印刷了以下错误：

Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-29 18:47:46.208490: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7102 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2018-03-29 18:47:46.210296: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
Aborted (core dumped)

我使用以下命令安装tensorflow-gpu：

pip3 install tensorflow-gpu 
pip3 install --upgrade tensorflow-gpu

这可能与“激活”tensorflow有关吗？我不确定如何在群集上执行此操作，因为我不确定这是否被视为虚拟环境

Answer 1

您需要降级cuDNN版本。我用 7.0.5 解决了这个问题。

Download cuDNN v7.0.5 (Dec 5, 2017), for CUDA 9.0

从 cuDNN v7.0.5 Linux for Linux 下载.tar文件。

（在Ubuntu 16上）

之前，您需要删除所有cuDNN文件：

sudo rm -rf /usr/local/cuda/include/cudnn.h
sudo rm -rf /usr/local/cuda/lib64/libcudnn*

现在从下载的文件中提取新的cuDNN：

tar xvzf cudnn-9.0-linux-x64-v7.tgz

将新文件移至 cuda 目录：

sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include    
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64

设置此文件的权限：

sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Tensorflow无法使用所有可见的GPU

1 个答案: