当我使用分布式并行数据替换并行数据时,验证集的结果变得非常差,就像过度拟合一样。我使用了4个GPU,每个GPU一个进程,保持学习率和批处理大小不变。以下是与DPP相关的所有代码:
dist.init_process_group(backend='nccl')
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
train_loader = torch.utils.data.DataLoader(
train_set, batch_size=args.batch_size,
num_workers=args.workers,sampler=train_sampler, pin_memory=True, shuffle=(train_sampler is None))
val_sampler = torch.utils.data.distributed.DistributedSampler(val_set)
val_loader = torch.utils.data.DataLoader(
val_set, batch_size=args.batch_size,
num_workers=args.workers, pin_memory=True, shuffle=False,sampler=val_sampler)
model = models.__dict__[args.arch](network_data).to(device)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
cudnn.benchmark = True
for epoch in tqdm(range(args.start_epoch, args.epochs)):
# train for one epoch
train_sampler.set_epoch(epoch)
train_loss=train(......)
dist.reduce(train_loss, 0, op=dist.ReduceOp.SUM)
print(train_loss/nb_gpus)
test_loss=validate(.....)
dist.reduce(test_loss, 0, op=dist.ReduceOp.SUM)
print(test_loss/nb_gpus)
蓝色曲线是验证集的结果
棕色曲线是训练集的结果