Question

我是PyTorch的新手，在运行torch.distributed的官方示例时感到困惑在PyTorch ImageNet main.py L304。

我对源代码的评估部分做了一些小的修改，如下所示：

model.eval()
    with torch.no_grad():
        end = time.time()
        for i, (images, target, image_ids) in enumerate(val_loader):
            if args.gpu is not None:
                images = images.cuda(args.gpu, non_blocking=True)

            target = target.cuda(args.gpu, non_blocking=True)
            image_ids = image_ids.data.cpu().numpy()
            output = model(images)
            loss = criterion(output, target)

            # Get acc1, acc5 and update
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            losses.update(loss.item(), images.size(0))
            top1.update(acc1[0], images.size(0))
            top1.update(acc1[0], images.size(0))
            top5.update(acc5[0], images.size(0))

            # print at i-th batch of images only
            dist.barrier()
            if i==0:
                if args.gpu==0:
                    print("gpu 0",acc1,output.shape)
                if args.gpu==1:
                    print("gpu 1",acc1,output.shape)
                if args.gpu==2:
                    print("gpu 2",acc1,output.shape)
                if args.gpu==3:
                    print("gpu 3",acc1,output.shape)

上面的代码给出以下输出：

Use GPU: 0 for training
Use GPU: 1 for training
Use GPU: 3 for training
Use GPU: 2 for training
=> loading checkpoint model_best.pth.tar'
...
gpu 3 tensor([75.], device='cuda:3') torch.Size([32, 200])
gpu 2 tensor([75.], device='cuda:2') torch.Size([32, 200])
gpu 1 tensor([75.], device='cuda:1') torch.Size([32, 200])
gpu 0 tensor([75.], device='cuda:0') torch.Size([32, 200])

当我使用批量大小为 128 的 4 GPU 时，我认为128张图像已被划分并分别送入4个GPU。因此，所有四个GPU都具有output.shape[0]=32 （其中200是num_classes个）。

但是让我真正困惑的是，所有四个GPU都显示相同的acc1。据我了解，由于4个GPU分别采用不同的输入部分（分别为32张图像），因此它们也应分别提供与它们的输入相对应的不同输出和精度。但是，在我的打印测试中，这些GPU显示出相同的输出和准确性。而且我不知道为什么，他们不应该不一样吗？

寻求帮助。预先谢谢您！

Answer 1

好吧，我认为可以在Github问题enter link description here上找到有关ImageNet的PyTorch官方示例代码的解释。

（分布式）为什么所有GPU都给出相同的输出？

1 个答案: