我以前分别训练了VGG模式(例如model1)和两层模型(例如model2),现在我必须训练一个将这两个模型结合在一起的新模型,并初始化新模型的每个部分借助学习的模型1和模型2的权重,我将其实现如下:
class TransferModel(nn.Module):
def __init__(self, VGG, TwoLayer):
super(TransferModel, self).__init__()
self.vgg_layer=VGG
self.linear = TwoLayer
for param in self.vgg_layer.parameters():
param.requires_grad = True
def forward(self, x):
h1_vgg = self.vgg_layer(x)
y_pred = self.linear(h1_vgg)
return y_pred
# for image_id in train_ids[0:1]:
# img = load_image(train_id_to_file[image_id])
new_model=TransferModel(trained_vgg_instance, trained_twolayer_instance)
new_model.linear.load_state_dict(trained_twolayer_instance.state_dict())
new_model.vgg_layer.load_state_dict(trained_vgg_instance.state_dict())
new_model.cuda()
训练时,我会尝试:
def train(model, learning_rate=0.001, batch_size=50, epochs=2):
optimizer=optim.Adam(model.parameters(), lr=learning_rate)
criterion = torch.nn.MultiLabelSoftMarginLoss()
x = torch.zeros([batch_size, 3, img_size, img_size])
y_true = torch.zeros([batch_size, 4096])
for epoch in range(epochs): # loop over the dataset multiple times
running_loss = 0.0
shuffled_indcs=torch.randperm(20000)
for i in range(20000):
for batch_num in range(int(20000/batch_size)):
optimizer.zero_grad()
for j in range(batch_size):
# ... some code to load batches of images into x....
x_batch=Variable(x).cuda()
print(batch_num)
y_true_batch=Variable(train_labels[batch_num*batch_size:(batch_num+1)*batch_size, :]).cuda()
y_pred =model(x_batch)
loss = criterion(y_pred, y_true_batch)
loss.backward()
optimizer.step()
running_loss += loss
del x_batch, y_true_batch, y_pred
torch.cuda.empty_cache()
print("in epoch[%d] = %.8f " % (epoch, running_loss /(batch_num+1)))
running_loss = 0.0
print('Finished Training')
train(new_model)
在第一个时期的第二次迭代中(batch_num = 1),出现此错误:
CUDA内存不足。尝试分配153.12 MiB(GPU 0; 5.93 GiB 总容量4.83 GiB已经分配; 66.94 MiB免费; 374.12 MiB 缓存)
尽管我在训练中明确使用了'del',但通过运行nvidia-smi看起来它什么也没做,并且内存没有被释放。
我该怎么办?
答案 0 :(得分:0)
更改此行:
running_loss += loss
对此:
running_loss += loss.item()
通过将loss
添加到running_loss
,您告诉pytorch将有关该批次的loss
的所有梯度保留在内存中,即使您开始训练下一个批次。 Pytorch认为,也许您以后可能希望在多个批次的某些较大损失函数中使用running_loss
,因此将所有批次的所有梯度(以及激活)都保留在内存中。
通过添加.item()
,您将以蟒蛇float
而不是torch.FloatTensor
的身份遭受损失。此浮动对象与pytorch图分离,因此pytorch知道您不需要相对于它的渐变。
如果您运行的是不带.item()
的pytorch较旧版本,则可以尝试:
running_loss += float(loss).cpu().detach
这也可能是由test()
循环中的类似错误(如果有)引起的。