Question

当我使用多个GPU时
以及当我将.cuda（）用于张量时
在训练过程中，出现以下错误

RuntimeError: binary_op(): expected both inputs to be on same device, 
but input a is on cuda:0 and input b is on cuda:7

，这意味着跟随R2的是cuda：7（不确定R2是否在cuda：0或cuda：7上）和
R1位于cuda：0
因此，无法操作，因为它们位于不同的GPU上

R2=torch.where(R2<1e-4,torch.Tensor([1e-4]).squeeze().cuda(),R2)
div_R1_R2=torch.div(R1,R2)

在以下代码中我得到了完全相同的错误
无法计算O_img_tc和R_gt_img_tc的投诉
因为它们在不同的GPU上

R_gt_img_tc=torch.where(
  torch.abs(R_gt_img_tc)<1e-4,
  torch.Tensor([1e-4]).squeeze().cuda(),R_gt_img_tc)
sha=torch.clamp(torch.div(O_img_tc,R_gt_img_tc),0.0,1.3)[:,0,:,:].unsqueeze(1)

如何解决此问题，我在做什么错？

我尝试过的方法：
-使用horovod：出现相同的错误。
-使用dense_O_img_tc.get_device()和dense_S_gt_img_tc.get_device()检查GPU编号
当他们像dense_O_img_tc.get_device()返回0时，dense_S_gt_img_tc.get_device()返回7

我尝试了

same_cuda=torch.device('cuda:'+str(dense_O_img_tc.get_device()))
dense_S_gt_img_tc=torch.where(
  torch.abs(dense_S_gt_img_tc)<1e-4,
  # Note here that I'm using cuda(same_cuda)
  torch.Tensor([1e-4]).squeeze().cuda(same_cuda),dense_S_gt_img_tc)

ref=torch.div(dense_O_img_tc,dense_S_gt_img_tc)

这实际上可以解决“不同的GPU问题”
但是仅使用了GPU：0，导致充满了GPU：0错误。

因此，我尝试将位于GPU：0上的dense_O_img_tc移向GPU：7（或通过使用density_S_gt_img_tc.get_device（）开启稠密的S_gt_img_tc的任何地方），我可能遇到以下非法的内存访问错误< / li>

RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/generic/THCTensorMath.cu:238

PyTorch：多GPU错误：RuntimeError：binary_op（）：预期两个输入都在同一设备上，但是输入a在cuda：0上，输入b在cuda：7

0 个答案: