Tensorflow MultiGPU coredump

时间:2016-12-14 09:21:06

标签: python tensorflow tensorboard panda3d

我正在尝试使用Tensorflow进行塔式多gpu培训。我大致都在关注cifar10 multigpu教程https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py

所以,我已经确定了范围。作为参考,我可以在这里找到构建resnet的整个课程:https://gist.github.com/mpkuse/6f9dcd419effa707422eb2c5097f51b4

我基本上将所有训练变量都保存在cpu上,并对我拥有的2个GPU中的每一个进行推断和成本计算。我做了渐变的平均值。

Tensorflow Init

puf_obj = puf.PlutoFlow(trainable_on_device='/cpu:0')

    # #multigpu - SGD Optimizer on cpu
    with tf.device( '/cpu:0' ):
        self.tf_learning_rate = tf.placeholder( 'float', shape=[], name='learning_rate' )
        self.tensorflow_optimizer = tf.train.AdamOptimizer( self.tf_learning_rate )



    # Define Deep Residual Nets
    # #multigpu - have the infer computation on each gpu (with different batches)
    self.tf_tower_cost = []
    self.tf_tower_infer = []
    tower_grad = []
    self.tf_tower_ph_x = []
    self.tf_tower_ph_label_x = []
    self.tf_tower_ph_label_y = []
    self.tf_tower_ph_label_z = []
    self.tf_tower_ph_label_yaw = []

    for gpu_id in [0,1]:
        with tf.device( '/gpu:'+str(gpu_id) ):
            with tf.name_scope( 'tower_'+str(gpu_id) ):
                # have placeholder `x`, label_x, label_y, label_z, label_yaw
                tf_x = tf.placeholder( 'float', [None,240,320,3], name='x' )
                tf_label_x = tf.placeholder( 'float',   [None,1], name='label_x')
                tf_label_y = tf.placeholder( 'float',   [None,1], name='label_y')
                tf_label_z = tf.placeholder( 'float',   [None,1], name='label_z')
                tf_label_yaw = tf.placeholder( 'float', [None,1], name='label_yaw')

                # infer op
                tf_infer_op = puf_obj.resnet50_inference(tf_x, is_training=True)  # Define these inference ops on all the GPUs

                # Cost
                with tf.variable_scope( 'loss'):
                    gpu_cost = self.define_l2_loss( tf_infer_op, tf_label_x, tf_label_y, tf_label_z, tf_label_yaw )

                # self._print_trainable_variables()


                # Gradient computation op
                # following grad variable contain a list of 2 elements each
                # ie. ( (grad_v0_gpu0,var0_gpu0),(grad_v1_gpu0,var1_gpu0) ....(grad_vN_gpu0,varN_gpu0) )
                tf_grad_compute = self.tensorflow_optimizer.compute_gradients( gpu_cost )


                # Make list of tower_cost, gradient, and placeholders
                self.tf_tower_cost.append( gpu_cost )
                self.tf_tower_infer.append( tf_infer_op )
                tower_grad.append( tf_grad_compute )

                self.tf_tower_ph_x.append(tf_x)
                self.tf_tower_ph_label_x.append(tf_label_x)
                self.tf_tower_ph_label_y.append(tf_label_y)
                self.tf_tower_ph_label_z.append(tf_label_z)
                self.tf_tower_ph_label_yaw.append(tf_label_yaw)


    self._print_trainable_variables()


    # Average Gradients (gradient_gpu0 + gradient_gpu1 + ...)
    with tf.device( '/gpu:0'):
        n_gpus = len( tower_grad )
        n_trainable_variables = len(tower_grad[0] )
        tf_avg_gradient = []

        for i in range( n_trainable_variables ): #loop over trainable variables
            t_var = tower_grad[0][i][1]

            t0_grad = tower_grad[0][i][0]
            t1_grad = tower_grad[1][i][0]

            # ti_grad = [] #get Gradients from each gpus
            # for gpu_ in range( n_gpus ):
            #     ti_grad.append( tower_grad[gpu_][i][0] )
            #
            # grad_total = tf.add_n( ti_grad, name='gradient_adder' )

            grad_total = tf.add( t0_grad, t1_grad )

            frac = 1.0 / float(n_gpus)
            t_avg_grad = tf.mul( grad_total ,  frac, name='gradi_scaling' )

            tf_avg_gradient.append( (t_avg_grad, t_var) )



    with tf.device( '/cpu:0' ):
        # Have the averaged gradients from all GPUS here as arg for apply_grad()
        self.tensorflow_apply_grad = self.tensorflow_optimizer.apply_gradients( tf_avg_gradient )

因为我遇到了梯度下降的问题。我试图用随机初始化来评估成本。我将会话运行为:

迭代中的代码

pp,qq = self.tensorflow_session.run( [self.tf_learning_rate,self.tf_tower_cost[1] ], \
                    feed_dict={self.tf_learning_rate:lr, \
                            self.tf_tower_ph_x[0]:im_batch[0:10,:,:,:],\
                                    self.tf_tower_ph_label_x[1]:label_batch[0:10,0:1], \
                                    self.tf_tower_ph_label_y[1]:label_batch[0:10,1:2], \
                                    self.tf_tower_ph_label_z[1]:label_batch[0:10,2:3], \
                                    self.tf_tower_ph_label_yaw[1]:label_batch[0:10,3:4], \
                            self.tf_tower_ph_x[1]:im_batch[10:20,:,:,:],\
                    } )

当我仅在GPU0上进行计算时,它会起作用! 虽然如果我在GPU1上进行计算,那么它就是coredumps。我有另一个cifar10 multi-gpu的例子,它完美无缺。所以我很确定我的GPU很好。我还要提一下,我的feed_data是用Panda3D(图形工具包)生成的。数据在任务中生成(在panda3d语言中)。

这会影响张量流吗?将此渲染数据用于数据馈送的最佳方法是什么?

E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 

我也遇到了这个错误:

E tensorflow/stream_executor/cuda/cuda_blas.cc:372] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
W tensorflow/stream_executor/stream.cc:1390] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "train_tf_decop.py", line 337, in <module>
    _, pp,qq = tensorflow_session.run( [tensorflow_apply_grad, tf_tower_cost[0], tf_tower_cost[1]], feed_dict=feed )
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=48000, n=64, k=64
     [[Node: tower_1/trainable_vars/res2a/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](tower_1/trainable_vars/MaxPool, trainable_vars/res2a/wc1/read/_503)]]
     [[Node: gradi_scaling_100/_3371 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10972_gradi_scaling_100", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'tower_1/trainable_vars/res2a/Conv2D', defined at:
  File "train_tf_decop.py", line 217, in <module>
    tf_infer_op = puf_obj.resnet50_inference(tf_x, is_training=True)  # Define these inference ops on all the GPUs
  File "/home/mpkuse/pluto_train/PlutoFlow.py", line 145, in resnet50_inference
    conv_out = self.resnet_unit( input_var, 64, [64,64,256], [1,3,1], is_training=is_training, short_circuit=False )
  File "/home/mpkuse/pluto_train/PlutoFlow.py", line 281, in resnet_unit
    conv_1 = self._conv2d_nobias( input_tensor, wc1, pop_mean=wc1_bn_pop_mean, pop_varn=wc1_bn_pop_varn, is_training=is_training, W_beta=wc1_bn_beta, W_gamma=wc1_bn_gamma )
  File "/home/mpkuse/pluto_train/PlutoFlow.py", line 414, in _conv2d_nobias
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 396, in conv2d
    data_format=data_format, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InternalError (see above for traceback): Blas SGEMM launch failed : m=48000, n=64, k=64
     [[Node: tower_1/trainable_vars/res2a/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](tower_1/trainable_vars/MaxPool, trainable_vars/res2a/wc1/read/_503)]]
     [[Node: gradi_scaling_100/_3371 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10972_gradi_scaling_100", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

我知道问题2033(https://github.com/tensorflow/tensorflow/issues/2033)。似乎它表明这个问题可能是由于空数组。

我的张量图如下所示。这里的一切看起来都很好。 我在Ubuntu 16.04,tensorflow 0.12,cudnn v5,python 2.7

任何有用的帮助/建议......! enter image description here

0 个答案:

没有答案