因此,在1.16版本中,我尝试优化一些代码以提高混合精度,并查看是否可以测量出速度的提高。
我尝试了两个训练循环(在rtx 2060上都使用了99%的gpu),但是我没有测量任何加速,因此我想在这里展示它们,以首先检查我的实现是否正确。我的实现基于https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/
上的代码示例没有混合精度的火车功能:
def train(model, device, train_loader, criterion, optimizer, scheduler, epoch, iter_meter, writer):
model.train()
data_len = len(train_loader.dataset)
train_start_time = time.time()
for batch_idx, _data in enumerate(train_loader):
spectrograms, labels, input_lengths, label_lengths = _data
spectrograms, labels = spectrograms.to(device), labels.to(device)
optimizer.zero_grad()
output = model(spectrograms) # (batch, time, n_class)
output = F.log_softmax(output, dim=2)
output = output.transpose(0, 1) # (time, batch, n_class)
loss = criterion(output, labels, input_lengths, label_lengths)
loss.backward()
writer.add_scalar("Loss/train", loss.item(), iter_meter.get())
writer.add_scalar("learning_rate", scheduler.get_last_lr()[0], iter_meter.get())
optimizer.step()
scheduler.step()
iter_meter.step()
return loss.item()
具有混合精度的火车功能
def train(model, device, train_loader, criterion, optimizer, scheduler, epoch, iter_meter, scaler, writer):
model.train()
data_len = len(train_loader.dataset)
train_start_time = time.time()
for batch_idx, _data in enumerate(train_loader):
spectrograms, labels, input_lengths, label_lengths = _data
spectrograms, labels = spectrograms.to(device), labels.to(device)
optimizer.zero_grad()
with torch.cuda.amp.autocast():
output = model(spectrograms) # (batch, time, n_class)
output = F.log_softmax(output, dim=2)
output = output.transpose(0, 1) # (time, batch, n_class)
loss = criterion(output, labels, input_lengths, label_lengths)
# Mixed precision
scaler.scale(loss).backward() # loss.backward()
scaler.step(optimizer) # optimizer.step()
scheduler.step() #Should I also put this steps in the scaler?
iter_meter.step()
# Updates the scale for next iteration
scaler.update()
writer.add_scalar("Loss/train", loss.item(), iter_meter.get())
writer.add_scalar("learning_rate", scheduler.get_last_lr()[0], iter_meter.get())
return loss.item()
主要区别在于,我将输入传递给模型,并计算with torch.cuda.amp.autocast():
内部的损耗,然后再使用定标器向后计算。
同时使用这两种功能进行培训的时间相同,我对可能原因的看法是:
如果有人要检查正在训练的原始模型,我可以在Github上保存它
有经验的眼睛可以看吗?