我有以下测试代码:
import tensorflow as tf
import numpy as np
def body(x):
a = tf.random_uniform(shape=[2, 2], dtype=tf.int32, maxval=100)
b = tf.constant(np.array([[1, 2], [3, 4]]), dtype=tf.int32)
c = a + b
return tf.nn.relu(x + c)
def condition(x):
return tf.reduce_sum(x) < 100
x = tf.Variable(tf.constant(0, shape=[2, 2]))
with tf.Session():
tf.initialize_all_variables().run()
result = tf.while_loop(condition, body, [x])
print(result.eval())
当我在GPU群集上运行它时,我产生以下错误:
2018-03-30 18:10:33.473913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10415 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:3d:00.0, compute capability: 6.1)
2018-03-30 18:10:33.591203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-30 18:10:33.688390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10415 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:60:00.0, compute capability: 6.1)
2018-03-30 18:10:33.806845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10415 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:61:00.0, compute capability: 6.1)
2018-03-30 18:10:33.913200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10415 MB memory) -> physical GPU (device: 4, name: GeForce GTX 1080 Ti, pci bus id: 0000:b1:00.0, compute capability: 6.1)
2018-03-30 18:10:34.018533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10415 MB memory) -> physical GPU (device: 5, name: GeForce GTX 1080 Ti, pci bus id: 0000:b2:00.0, compute capability: 6.1)
Killed
当我使用CUDA_VISIBLE_DEVICES='6' python script.py
运行脚本时,它会使用GPU中止。可能是什么导致了这个?这可能是一个有缺陷的GPU吗?
nvidia-smi
报告以下内容:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25 Driver Version: 390.25 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:3D:00.0 Off | N/A |
| 28% 21C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:3E:00.0 Off | N/A |
| 28% 21C P8 7W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:60:00.0 Off | N/A |
| 28% 24C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:61:00.0 Off | N/A |
| 28% 25C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:B1:00.0 Off | N/A |
| 28% 19C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:B2:00.0 Off | N/A |
| 28% 20C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:DA:00.0 Off | N/A |
| 28% 22C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:DB:00.0 Off | N/A |
| 28% 21C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Tensorflow版本为1.7.0
,CUDA版本为9.0.176
答案 0 :(得分:0)
问题是创建作业以使用那么多GPU时,我没有请求足够的RAM空间。要使用8个GPU,您需要足够的空间,也许大约60 Gi。