Question

我正在处理那些特定的规范：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                    0 |
| N/A   62C    P0   101W / 149W |  10912MiB / 11439MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:06:00.0 Off |                    0 |
| N/A   39C    P0    72W / 149W |  10919MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:84:00.0 Off |                    0 |
| N/A   50C    P0    57W / 149W |  10919MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:85:00.0 Off |                    0 |
| N/A   42C    P0    69W / 149W |  10919MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

使用Python 3.6，CUDA 8：

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

CUDNN 5.1.10：

#define CUDNN_MAJOR      5
#define CUDNN_MINOR      1
#define CUDNN_PATCHLEVEL 10
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

我想在带有Tensorflow后端的GPU＃1上运行keras。考虑到我的CUDA / CUDNN版本，我知道我必须安装tensorflow-gpu 1.2和keras 2.0.5（有关兼容性，请参见here和here）。

首先，我创建一个像这样的虚拟环境：

conda create -n keras
source activate keras
conda install keras=2.0.5 tensorflow-gpu=1.2

然后，如果我使用以下脚本测试整个过程：

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import keras
model = keras.models.Sequential()
model.add(keras.layers.Dense(1,input_dim=1))
model.compile(loss="mse",optimizer="adam")
import numpy as np
model.fit(np.arange(12).reshape(-1,1),np.arange(12))

我收到以下错误：

Epoch 1/10
2018-12-13 15:20:42.971806: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.971827: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.971833: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.971838: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.971843: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.996052: E tensorflow/core/common_runtime/direct_session.cc:138] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE

我们在日志中看到它试图在设备0上创建会话，该会话已被占用，如nvidia-smi命令所示。但是我指定在脚本中使用数字1。

您知道这里可能出什么问题吗？

很抱歉，如果这个问题不合适，但是我已经为此苦苦挣扎了几天，而且似乎无法进一步发展。

Answer 1

自从我解决问题以来，我回答了自己的问题。

实际上有两个问题：

在安装tensorflow-gpu = 1.2时，安装的是cudnn的6.0版（我安装了cudnn 5.1.10）。解决方案是像这样安装软件包：

conda install keras = 2.0.5 tensorflow-gpu = 1.2 cudnn = 5.1.10
第二个问题，实际上是“实际”问题，是我的某些旧进程仍在后台运行。尽管它们未在nvidia-smi面板中列出，但它们仍装有GPU，因此我的测试无法访问它们。使用kill命令杀死这些进程可以解决此问题

我希望这些见解可以帮助其他像我一样挣扎的人。

CUDA_ERROR_INVALID设备的keras = 2.0.5和tensorflow-gpu = 1.2.1

1 个答案: