我有4个GPU(0、1、2、3),我想在GPU 2上运行一个Jupyter笔记本,在GPU 0上运行另一个,因此,在执行之后,
export CUDA_VISIBLE_DEVICES=0,1,2,3
对于我使用的GPU 2笔记本电脑
device = torch.device( f'cuda:{2}' if torch.cuda.is_available() else 'cpu')
device, torch.cuda.device_count(), torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.get_device_properties(1)
在创建新模型或加载模型后,
model = nn.DataParallel( model, device_ids = [ 0, 1, 2, 3])
model = model.to( device)
然后,当我开始训练模型时,我得到了
RuntimeError Traceback (most recent call last)
<ipython-input-18-849ffcb53e16> in <module>
46 with torch.set_grad_enabled( phase == 'train'):
47 # [N, Nclass, H, W]
---> 48 prediction = model(X)
49 # print( prediction.shape, y.shape)
50 loss_matrix = criterion( prediction, y)
~/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
--> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)
~/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
144 raise RuntimeError("module must have its parameters and buffers "
145 "on device {} (device_ids[0]) but found one of "
--> 146 "them on device: {}".format(self.src_device_obj, t.device))
147
148 inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2
答案 0 :(得分:2)
DataParallel
要求在其device_ids
列表中的第一个设备上提供每个输入张量。
在将其散布到其他GPU之前,它基本上是将该设备用作暂存区域,并且该设备是收集最终输出然后从正向返回的设备。如果要将设备2用作主要设备,则只需将其放在列表的开头,如下所示:
model = nn.DataParallel(model, device_ids = [2, 0, 1, 3])
model.to(f'cuda:{model.device_ids[0]}')
此后,提供给模型的所有张量也应在第一个设备上。
x = ... # input tensor
x = x.to(f'cuda:{model.device_ids[0]}')
y = model(x)
答案 1 :(得分:0)
对我来说,以下作品也是如此:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
network = nn.DataParallel(network)
network.to(device)
tnsr = tnsr.to(device)
答案 2 :(得分:0)
使用火炬时发生此错误,模型和数据均不在 cuda 上:
尝试一些这样的代码在 cuda 上进行建模和数据集
model = model.toDevice(‘cuda’)
images = images.toDevice(‘cuda’)