在TPU上使用全局批量大小进行培训(tensorflow)

时间:2020-07-09 05:11:51

标签: tensorflow neural-network tensorflow2.0 tpu batchsize

我最近在Google Colab上启动了一个神经网络项目,发现可以使用TPU。我一直在研究如何使用它,并且发现了tensorflow的TPUStrategy(我正在使用tensorflow 2.2.0),并且能够成功定义模型并在TPU上进行训练。

但是,我不确定这是什么意思。可能是因为我没有充分阅读Google的TPU指南,但我的意思是我不知道在训练过程中到底会发生什么。

该指南要求您定义一个GLOBAL_BATCH_SIZE,每个TPU内核占用的批处理大小由per_replica_batch_size = GLOBAL_BATCH_SIZE / strategy.num_replicas_in_sync给出,这意味着每个TPU的批处理大小小于您开始的批处理大小用。在Colab上,strategy.num_replicas_in_sync = 8表示如果我以64的GLOBAL_BATCH_SIZE开头,则per_replica_batch_size为8。

现在,我不了解的是,当我计算一个训练步骤时,优化器是按批量per_replica_batch_size的批次计算8个不同的步骤,还是在8次不同的时间更新模型的权重,还是以此方式并行化训练步骤的计算,最后在一批GLOBAL_BATCH_SIZE的大小上仅计算一个优化程序步骤。谢谢。

1 个答案:

答案 0 :(得分:0)

这是一个很好的问题,并且与 Distribution Strategy 相关。

在经历了Tensorflow DocumentationTPU Strategy Documentation和对Synchronous and Asynchronous Training的解释之后,

我可以这么说

> the optimizer computes 8 different steps on batches of size
> per_replica_batch_size, updating the weights of the model 8 different
> times

以下来自Tensorflow Documentation的解释应阐明:

> So, how should the loss be calculated when using a
> tf.distribute.Strategy?
> 
> For an example, let's say you have 4 GPU's and a batch size of 64. One
> batch of input is distributed across the replicas (4 GPUs), each
> replica getting an input of size 16.
> 
> The model on each replica does a forward pass with its respective
> input and calculates the loss. Now, instead of dividing the loss by
> the number of examples in its respective input (BATCH_SIZE_PER_REPLICA
> = 16), the loss should be divided by the GLOBAL_BATCH_SIZE (64).

在下面以及其他链接中提供解释(以防将来无法正常工作):

TPU Strategy documentation状态:

> In terms of distributed training architecture, `TPUStrategy` is the
> same `MirroredStrategy` - it implements `synchronous` distributed
> training. `TPUs` provide their own implementation of efficient
> `all-reduce` and other collective operations across multiple `TPU`
> cores, which are used in `TPUStrategy`.

Synchronous and Asynchronous Training的说明如下:

> `Synchronous vs asynchronous training`: These are two common ways of
> `distributing training` with `data parallelism`. In `sync training`, all
> `workers` train over different slices of input data in `sync`, and
> **`aggregating gradients`** at each step. In `async` training, all workers are
> independently training over the input data and updating variables
> `asynchronously`. Typically sync training is supported via all-reduce
> and `async` through parameter server architecture.

您也可以通过本MPI Tutorial来详细了解All_Reduce的概念。

下面的屏幕截图显示了All_Reduce的工作方式:

enter image description here