Question

我正在尝试根据mask rcnn的示例修改cifar10的代码以在multi-gpu上运行它，代码的大部分内容都在

之下

从TFRecords文件中读取一个图像和地面实况信息，如下所示

    image, ih, iw, gt_boxes, gt_masks, num_instances, img_id = \
        datasets.get_dataset(FLAGS.dataset_name,
                            FLAGS.dataset_split_name,
                            FLAGS.dataset_dir,
                            FLAGS.im_batch,
                            is_training=True)

此处image和num_instance的尺寸在图片中有所不同，然后这些输入存储在RandomShuffleQueue中，如下所示

    data_queue = tf.RandomShuffleQueue(capacity=32, min_after_dequeue=16,
            dtypes=(
                image.dtype, ih.dtype, iw.dtype,
                gt_boxes.dtype, gt_masks.dtype,
                num_instances.dtype, img_id.dtype))

    enqueue_op = data_queue.enqueue((image, ih, iw, gt_boxes, gt_masks, num_instances, img_id))
    data_queue_runner = tf.train.QueueRunner(data_queue, [enqueue_op] * 4)
    tf.add_to_collection(tf.GraphKeys.QUEUE_RUNNERS, data_queue_runner)

我使用tower_grads收集每个GPU中的渐变，然后对它们求平均值，下面是multi-gpu的代码

    tower_grads = []
    num_gpus = 2
    with tf.variable_scope(tf.get_variable_scope()):
        for i in xrange(num_gpus):
            with tf.device('/gpu:%d' % i):
                with tf.name_scope('tower_%d' % i) as scope:

                    (image, ih, iw, gt_boxes, gt_masks, num_instances, img_id) =  data_queue.dequeue()
                    im_shape = tf.shape(image)
                    image = tf.reshape(image, (im_shape[0], im_shape[1], im_shape[2], 3))

                    total_loss = compute_loss() # use tensor from dequeue operation to compute loss

                    grads = compute_grads(total_loss)
                    tower_grads.append(grads)

    grads = average_grads(tower_grads)

当num_gpus=1时，代码运行良好（我的意思是没有错误），但是当我使用两个TITAN X GPU时，下面会出现一些奇怪的错误

无法排队异步我mset操作：CUDA_ERROR_INVALID_HANDLE
内部：Blas GEMM发布失败

并且多次运行代码时错误不一样。我无法弄清楚为什么这些错误会出现在多个数据队列或GPU上？

使用multi-gpu和tf.RandomShuffleQueue的tensorflow

0 个答案: