Question

当我使用分布式并行数据替换并行数据时，验证集的结果变得非常差，就像过度拟合一样。我使用了4个GPU，每个GPU一个进程，保持学习率和批处理大小不变。以下是与DPP相关的所有代码：

dist.init_process_group(backend='nccl')
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)

train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
train_loader = torch.utils.data.DataLoader(
        train_set, batch_size=args.batch_size,
        num_workers=args.workers,sampler=train_sampler, pin_memory=True, shuffle=(train_sampler is None))
val_sampler = torch.utils.data.distributed.DistributedSampler(val_set)
val_loader = torch.utils.data.DataLoader(
        val_set, batch_size=args.batch_size,
        num_workers=args.workers, pin_memory=True, shuffle=False,sampler=val_sampler)

model = models.__dict__[args.arch](network_data).to(device)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
cudnn.benchmark = True
for epoch in tqdm(range(args.start_epoch, args.epochs)):
    # train for one epoch
    train_sampler.set_epoch(epoch)

    train_loss=train(......)
    dist.reduce(train_loss, 0, op=dist.ReduceOp.SUM)
    print(train_loss/nb_gpus)

    test_loss=validate(.....)
    dist.reduce(test_loss, 0, op=dist.ReduceOp.SUM)
    print(test_loss/nb_gpus)

enter image description here

蓝色曲线是验证集的结果

enter image description here

棕色曲线是训练集的结果

为什么用DDP替换DP后性能会急剧下降？

0 个答案: