Question

我在PyTorch中建立了模型。 代码在CPU上运行良好。但是，似乎在初始化网络并将其切换到cuda之后， GPU内存已用完。

失败的代码是：

model = init_from_scratch(args, train_exs, dev_exs)
model.init_optimizer()

nParams= sum([np.prod(list(p.size())) for p in model.network.parameters()])
print('* total number of parameters:',nParams)

device = torch.device("cuda:"+str(args.gpu) if args.cuda else "cpu")
model.set_device(device) #call last!!

# DATA ITERATORS  (not run to here)
train_dataset = data.ReaderDataset()
train_sampler = torch.utils.data.sampler.RandomSampler()
train_loader = torch.utils.data.DataLoader()
...


# TRAIN (not run to here)
for epoch in range(start_epoch, args.num_epochs):
....

def set_device(self,device):
    print("device",device)
    self.use_cuda = False if str(device) == "cpu" else True
    self.network = self.network.to(device)

参数总数：30708906
火炬版本为'0.4.0'
具有12G GPU内存和nvidia-smi的TITAN Xp免费显示

那么，我的代码有什么问题？

错误输出为：

Traceback (most recent call last):
File "script/train2.0.py", line 683, in <module>
  main(args)
File "script/train2.0.py", line 542, in main
  model.set_device(device)
File "./model.py", line 588, in set_device
  self.network = self.network.to(device)
File "/home/username/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 393, in to
  return self._apply(lambda t: t.to(device))
File "/home/username/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 176, in _apply
  module._apply(fn)
File "/home/username/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 182, in _apply
  param.data = fn(param.data)
File "/home/username/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 393, in <lambda>
  return self._apply(lambda t: t.to(device))
File "/home/username/.local/lib/python3.5/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
  torch._C._cuda_init()

RuntimeError：cuda运行时错误（2）：/pytorch/aten/src/THC/THCTensorRandom.cu:25内存不足

初始化后PyTorch就会耗尽GPU内存

0 个答案: