为什么用DDP替换DP后性能会急剧下降?

时间:2020-06-14 13:58:07

标签: python deep-learning computer-vision pytorch distributed

当我使用分布式并行数据替换并行数据时,验证集的结果变得非常差,就像过度拟合一样。我使用了4个GPU,每个GPU一个进程,保持学习率和批处理大小不变。以下是与DPP相关的所有代码:

dist.init_process_group(backend='nccl')
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)

train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
train_loader = torch.utils.data.DataLoader(
        train_set, batch_size=args.batch_size,
        num_workers=args.workers,sampler=train_sampler, pin_memory=True, shuffle=(train_sampler is None))
val_sampler = torch.utils.data.distributed.DistributedSampler(val_set)
val_loader = torch.utils.data.DataLoader(
        val_set, batch_size=args.batch_size,
        num_workers=args.workers, pin_memory=True, shuffle=False,sampler=val_sampler)

model = models.__dict__[args.arch](network_data).to(device)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
cudnn.benchmark = True
for epoch in tqdm(range(args.start_epoch, args.epochs)):
    # train for one epoch
    train_sampler.set_epoch(epoch)

    train_loss=train(......)
    dist.reduce(train_loss, 0, op=dist.ReduceOp.SUM)
    print(train_loss/nb_gpus)

    test_loss=validate(.....)
    dist.reduce(test_loss, 0, op=dist.ReduceOp.SUM)
    print(test_loss/nb_gpus)

enter image description here

蓝色曲线是验证集的结果

enter image description here

棕色曲线是训练集的结果

0 个答案:

没有答案