Keras Multi GPU示例提供ResourceExhaustedError

时间:2017-10-26 12:57:53

标签: tensorflow keras multi-gpu

所以我尝试在Keras上使用多个GPU。当我使用示例程序运行training_utils.py时(在training_utils.py代码中作为注释给出),我最终得到ResourceExhaustedErrornvidia-smi告诉我,四个GPU中只有一个正在运行。使用一个GPU适用于其他程序。

  • TensorFlow 1.3.0
  • Keras 2.0.8
  • Ubuntu 16.04
  • CUDA / cuDNN 8.0 / 6.0

问题:任何人都知道这里发生了什么?

控制台输出:

(...)

  

2017-10-26 14:39:02.086838:W tensorflow / core / common_runtime / bfc_allocator.cc:277] *********************** ************************************************** **************************X   2017-10-26 14:39:02.086857:W tensorflow / core / framework / op_kernel.cc:1192]资源耗尽:OOM在分配形状的张量时[128,55,55,256]   Traceback(最近一次调用最后一次):     文件" test.py",第27行,in       parallel_model.fit(x,y,epochs = 20,batch_size = 256)     文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py" ;,第1631行,in fit       validation_steps = validation_steps)     文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py" ;,第1213行,在_fit_loop中       outs = f(ins_batch)     文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py" ;,第2331行,致电       ** self.session_kwargs)     文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py" ;,第895行,在运行中       run_metadata_ptr)     文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py" ;,第1124行,在_run中       feed_dict_tensor,options,run_metadata)     文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py" ;,第1321行,在_do_run中       选项,run_metadata)     文件" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py" ;,第1340行,在_do_call       提升类型(e)(node_def,op,message)   tensorflow.python.framework.errors_impl.ResourceExhaustedError:分配形状的张量时的OOM [128,55,55,256]        [[节点:replica_1 / xception / block3_sepconv2 / separable_conv2d = Conv2D [T = DT_FLOAT,data_format =" NHWC",padding =" VALID",strides = [1,1,1,1] ],use_cudnn_on_gpu = true,_device =" / job:localhost / replica:0 / task:0 / gpu:1"](replica_1 / xception / block3_sepconv2 / separable_conv2d / depthwise,block3_sepconv2 / pointwise_kernel / read / _2103 )]]        [[节点:training / RMSprop / gradients / replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter / _4511 = _Revvclient_terminated = false,recv_device =" / job:localhost / replica:0 / task:0 / cpu:0&#34 ;,send_device =" / job:localhost / replica:0 / task:0 / gpu:0",send_device_incarnation = 1,tensor_name =" edge_25380_training / RMSprop / gradients / replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter",tensor_type = DT_FLOAT,_device =" / job:localhost / replica:0 / task:0 / cpu:0"]]

     

由op u' replica_1 / xception / block3_sepconv2 / separable_conv2d'引起,   定义于:File" test.py",第19行,in       parallel_model = multi_gpu_model(model,gpus = 2)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py",   第143行,在multi_gpu_model中       outputs = model(inputs)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",   第603行,致电       output = self.call(inputs,** kwargs)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",   第2061行,随叫随到       output_tensors,_,_ = self.run_internal_graph(输入,掩码)文件   " /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py" ;,   第2212行,在run_internal_graph中       output_tensors = _to_list(layer.call(computed_tensor,** kwargs))文件   " /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py" ;,   第1221行,正在通话中       dilation_rate = self.dilation_rate)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py",   第3279行,在separable_conv2d中       data_format = tf_data_format)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py",   第497行,在separable_conv2d中       name = name)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py",   第397行,在conv2d       data_format = data_format,name = name)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py",   第767行,在apply_op中       op_def = op_def)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",   第2630行,在create_op中       original_op = self._default_original_op,op_def = op_def)File" /home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",   第1204行,在 init 中       self._traceback = self._graph._extract_stack()#pylint:disable = protected-access

     

ResourceExhaustedError(参见上面的回溯):分配时的OOM   张量形状[128,55,55,256] [节点:   replica_1 / xception / block3_sepconv2 / separable_conv2d =   Conv2D [T = DT_FLOAT,data_format =" NHWC",padding =" VALID",strides = [1,1,   1,1],use_cudnn_on_gpu = true,   _device =" /作业:本地主机/复制:0 /任务:0 / GPU:1"](replica_1 / xception / block3_sepconv2 / separable_conv2d /深度方向,   block3_sepconv2 / pointwise_kernel / read / _2103)]] [[节点:   训练/ RMSprop /梯度/ replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter / _4511   = _Recvclient_terminated = false,recv_device =" / job:localhost / replica:0 / task:0 / cpu:0",   send_device =" /作业:本地主机/复制:0 /任务:0 / GPU:0&#34 ;,   send_device_incarnation = 1,   tensor_name =" edge_25380_training / RMSprop /梯度/ replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter&#34 ;,   tensor_type = DT_FLOAT,   _device =" /作业:本地主机/复制:0 /任务:0 / CPU:0"]]

编辑(已添加示例代码):

import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np

num_samples = 1000
height = 224
width = 224
num_classes = 100

with tf.device('/cpu:0'):
    model = Xception(weights=None,
                     input_shape=(height, width, 3),
                     classes=num_classes)

parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
                   optimizer='rmsprop')

x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))

parallel_model.fit(x, y, epochs=20, batch_size=128)

1 个答案:

答案 0 :(得分:2)

当在GPU上遇到OOM / ResourceExhaustedError时,我认为更改(减少)batch size是首先尝试的正确选项。

  

对于不同的GPU,您可能需要基于GPU的不同批量大小   你有记忆。

最近我遇到了类似的问题,调整了很多不同类型的实验。

以下是question的链接(也包括一些技巧)。

但是,在减少批量大小的同时,您可能会发现训练速度变慢。