Question

令我感到惊讶的是，我在不停止培训的情况下无法在线找到有关如何动态调整GPU批量大小的任何资源。

想法如下：

1）具有（几乎）与所使用的GPU无关的训练脚本。批量大小将动态调整，而不会受到用户的干扰或需要调整。

2）仍然能够指定所需的训练批次大小，即使太大而无法容纳最大的已知GPU。

例如，假设我要使用4096张图片的批量训练模型，每个图片1024x1024。假设我们可以访问具有不同NVidea GPU的服务器，但是我不知道将事先分配给我哪个服务器。（或者每个人都想使用最大的GPU，而在我任职之前我已经等待了很长时间）。

我希望我的训练脚本找到最大批处理大小（假设每个GPU批处理为32张图像），并且仅在处理完所有4096张图像（一个训练批= 128个GPU批处理）后才更新优化器。

Answer 1

有多种方法可以解决此问题。但是，如果不能指定可以执行此任务的GPU，或者使用多个GPU，则可以动态地调整GPU的批处理大小。

I prepared this repo with an illustrative training example in pytorch（在TensorFlow中应该类似工作）

在下面的代码中，try / except用于尝试不同的GPU批大小，而无需停止训练。当批次过大时，将缩小尺寸并关闭自适应功能。请检查回购以获取实施细节和可能的错误修复。

它还实现了一种称为批处理欺骗的技术，该技术在进行反向传播之前执行许多前向传递。在PyTorch中，只需替换optimizer.zero_grad（）。

import torch
import torchvision
import torch.optim as optim
import torch.nn as nn

# Example of how to use it with Pytorch
if __name__ == "__main__":

    # #############################################################
    # 1) Initialize the dataset, model, optimizer and loss as usual.
    # Initialize a fake dataset

    trainset = torchvision.datasets.FakeData(size=1_000_000,
                                             image_size=(3, 224, 224),
                                             num_classes=1000)

    # initialize the model, loss and SGD-based optimizer
    resnet = torchvision.models.resnet152(pretrained=True,
                                          progress=True)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(resnet.parameters(), lr=0.01)

    continue_training = True  # criteria to stop the training

    # #############################################################
    # 2) Set parameters for the adaptive batch size
    adapt = True  # while this is true, the algorithm will perform batch adaptation
    gpu_batch_size = 2  # initial gpu batch_size, it can be super small
    train_batch_size = 2048  # the train batch size of desire

    # Modified training loop to allow for adaptive batch size
    while continue_training:

        # #############################################################
        # 3) Initialize dataloader and batch spoofing parameter
        # Dataloader has to be reinicialized for each new batch size.
        trainloader = torch.utils.data.DataLoader(trainset,
                                                  batch_size=int(gpu_batch_size),
                                                  shuffle=True)

        # Number of repetitions for batch spoofing
        repeat = max(1, int(train_batch_size / gpu_batch_size))

        try:  # This will make sure that training is not halted when the batch size is too large

            # #############################################################
            # 4) Epoch loop with batch spoofing
            optimizer.zero_grad()  # done before training because of batch spoofing.

            for i, (x, y) in enumerate(trainloader):

                y_pred = resnet(x)
                loss = criterion(y_pred, y)
                loss.backward()

                # batch spoofing
                if not i % repeat:
                    optimizer.step()
                    optimizer.zero_grad()

                # #############################################################
                # 5) Adapt batch size while no RuntimeError is rased.
                # Increase batch size and get out of the loop
                if adapt:
                    gpu_batch_size *= 2
                    break

                # Stopping criteria for training
                if i > 100:
                    continue_training = False

        # #############################################################
        # 6) After the largest batch size is found, the training progresses with the fixed batch size.
        # CUDA out of memory is a RuntimeError, the moment we will get to it when our batch size is too large.
        except RuntimeError as run_error:
            gpu_batch_size /= 2  # resize the batch size for the biggest that works in memory
            adapt = False  # turn off the batch adaptation

            # Number of repetitions for batch spoofing
            repeat = max(1, int(train_batch_size / gpu_batch_size))

            # Manual check if the RuntimeError was caused by the CUDA or something else.
            print(f"---\nRuntimeError: \n{run_error}\n---\n Is it a cuda error?")

如果您有可以在Tensorflow，Caffe或其他版本中执行类似操作的代码，请共享！

Answer 2

如何在不停止训练的情况下动态调整GPU批量大小

有一个very similar question使用随机采样器完成工作。

我只需要添加另一个选项：DataLoader具有collate_fn，您可以用来更改bs。

collate_fn（可调用，可选）–合并样本列表，以形成张量的小批量。在从地图样式数据集中使用批量加载时使用。

训练期间如何调整GPU批次大小？

2 个答案: