Question

我已经看过cifar10 multi-GPU implementation来为我自己的GPU训练模型的并行化提供灵感。

我的模型使用来自TFRecords的数据，这些数据是通过tf.data.Iterator类进行迭代的。因此，给定2个GPU，我要尝试的是在CPU上为每个GPU调用一次iterator.get_next()（例如两次），进行一些预处理，嵌入查找和其他与CPU相关的工作，然后将这两个批处理馈送到GPU中

伪代码：

with tf.device('/cpu:0'):
    batches = []
    for gpu in multiple_gpus:
        single_gpu_batch = cpu_function(iterator.get_next())
        batches.append(single_gpu_batch)

    ....................

for gpu, batch in zip(multiple_gpus, batches):
    with tf.device('/device:GPU:{}'.format(gpu.id):
        single_gpu_loss = inference_and_loss(batch)
        tower_losses.append(single_gpu_loss)
        ...........
        ...........

total_loss = average_loss(tower_losses)

问题是，如果仅从数据中抽取一个或更少的示例，并且我两次调用iterator.get_next()，则会引发tf.errors.OutOfRange异常，并且第一次调用{ {1}}（实际上没有失败，只有第二个失败）将永远不会通过GPU。

我曾考虑过在一个iterator.get_next()调用中绘制数据并在以后进行拆分，但是iterator.get_next()的批量大小失败无法通过GPU的数量来划分。

在多GPU设置中实现迭代器消费的正确方法是什么？

Answer 1

我认为第二个建议是最简单的方法。为了避免最后一批的拆分问题，可以在drop_remainder中使用dataset.batch选项；或者，如果您需要查看所有数据，那么一种可能的解决方案是根据绘制批次的大小显式设置尺寸，以使拆分操作永远不会失败：

dataset = dataset.batch(batch_size * multiple_gpus)
iterator = dataset.make_one_shot_iterator()
batches = iterator.get_next()

split_dims = [0] * multiple_gpus
drawn_batch_size = tf.shape(batches)[0]

要么贪婪，要么在每个设备上拟合batch_size张量，直到用完

#### Solution 1 [Greedy]: 
for i in range(multiple_gpus):
  split_dims[i] = tf.maximum(0, tf.minimum(batch_size, drawn_batch_size))
  drawn_batch_size -= batch_size

或以更广泛的方式确保每个设备至少获得一个样本（假设multiple_gpus <drawn_batch_size）

### Solution 2 [Spread]
drawn_batch_size -= - multiple_gpus
for i in range(multiple_gpus):
  split_dims[i] = tf.maximum(0, tf.minimum(batch_size - 1, drawn_batch_size)) + 1
  drawn_batch_size -= batch_size

## Split batches
batches = tf.split(batches, split_dims)

带有多GPU设置的tf.data.Iterator

1 个答案: