所以我尝试在Keras上使用多个GPU。当我使用示例程序运行training_utils.py时(在training_utils.py代码中作为注释给出),我最终得到ResourceExhaustedError
。 nvidia-smi
告诉我,四个GPU中只有一个正在运行。使用一个GPU适用于其他程序。
问题:任何人都知道这里发生了什么?
控制台输出:
(...)
2017-10-26 14:39:02.086838:W tensorflow / core / common_runtime / bfc_allocator.cc:277] *********************** ************************************************** **************************X 2017-10-26 14:39:02.086857:W tensorflow / core / framework / op_kernel.cc:1192]资源耗尽:OOM在分配形状的张量时[128,55,55,256] Traceback(最近一次调用最后一次): 文件" test.py",第27行,in parallel_model.fit(x,y,epochs = 20,batch_size = 256) 文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py" ;,第1631行,in fit validation_steps = validation_steps) 文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py" ;,第1213行,在_fit_loop中 outs = f(ins_batch) 文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py" ;,第2331行,致电 ** self.session_kwargs) 文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py" ;,第895行,在运行中 run_metadata_ptr) 文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py" ;,第1124行,在_run中 feed_dict_tensor,options,run_metadata) 文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py" ;,第1321行,在_do_run中 选项,run_metadata) 文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py" ;,第1340行,在_do_call 提升类型(e)(node_def,op,message) tensorflow.python.framework.errors_impl.ResourceExhaustedError:分配形状的张量时的OOM [128,55,55,256] [[节点:replica_1 / xception / block3_sepconv2 / separable_conv2d = Conv2D [T = DT_FLOAT,data_format =" NHWC",padding =" VALID",strides = [1,1,1,1] ],use_cudnn_on_gpu = true,_device =" / job:localhost / replica:0 / task:0 / gpu:1"](replica_1 / xception / block3_sepconv2 / separable_conv2d / depthwise,block3_sepconv2 / pointwise_kernel / read / _2103 )]] [[节点:training / RMSprop / gradients / replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter / _4511 = _Revvclient_terminated = false,recv_device =" / job:localhost / replica:0 / task:0 / cpu:0&#34 ;,send_device =" / job:localhost / replica:0 / task:0 / gpu:0",send_device_incarnation = 1,tensor_name =" edge_25380_training / RMSprop / gradients / replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter",tensor_type = DT_FLOAT,_device =" / job:localhost / replica:0 / task:0 / cpu:0"]]
由op u' replica_1 / xception / block3_sepconv2 / separable_conv2d'引起, 定义于:File" test.py",第19行,in parallel_model = multi_gpu_model(model,gpus = 2)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py", 第143行,在multi_gpu_model中 outputs = model(inputs)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", 第603行,致电 output = self.call(inputs,** kwargs)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", 第2061行,随叫随到 output_tensors,_,_ = self.run_internal_graph(输入,掩码)文件 " /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py" ;, 第2212行,在run_internal_graph中 output_tensors = _to_list(layer.call(computed_tensor,** kwargs))文件 " /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py" ;, 第1221行,正在通话中 dilation_rate = self.dilation_rate)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", 第3279行,在separable_conv2d中 data_format = tf_data_format)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py", 第497行,在separable_conv2d中 name = name)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", 第397行,在conv2d data_format = data_format,name = name)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", 第767行,在apply_op中 op_def = op_def)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", 第2630行,在create_op中 original_op = self._default_original_op,op_def = op_def)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", 第1204行,在 init 中 self._traceback = self._graph._extract_stack()#pylint:disable = protected-access
ResourceExhaustedError(参见上面的回溯):分配时的OOM 张量形状[128,55,55,256] [节点: replica_1 / xception / block3_sepconv2 / separable_conv2d = Conv2D [T = DT_FLOAT,data_format =" NHWC",padding =" VALID",strides = [1,1, 1,1],use_cudnn_on_gpu = true, _device =" /作业:本地主机/复制:0 /任务:0 / GPU:1"](replica_1 / xception / block3_sepconv2 / separable_conv2d /深度方向, block3_sepconv2 / pointwise_kernel / read / _2103)]] [[节点: 训练/ RMSprop /梯度/ replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter / _4511 = _Recvclient_terminated = false,recv_device =" / job:localhost / replica:0 / task:0 / cpu:0", send_device =" /作业:本地主机/复制:0 /任务:0 / GPU:0&#34 ;, send_device_incarnation = 1, tensor_name =" edge_25380_training / RMSprop /梯度/ replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter&#34 ;, tensor_type = DT_FLOAT, _device =" /作业:本地主机/复制:0 /任务:0 / CPU:0"]]
编辑(已添加示例代码):
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 100
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
parallel_model.fit(x, y, epochs=20, batch_size=128)
答案 0 :(得分:2)
当在GPU上遇到OOM / ResourceExhaustedError时,我认为更改(减少)batch size
是首先尝试的正确选项。
对于不同的GPU,您可能需要基于GPU的不同批量大小 你有记忆。
最近我遇到了类似的问题,调整了很多不同类型的实验。
以下是question的链接(也包括一些技巧)。
但是,在减少批量大小的同时,您可能会发现训练速度变慢。