如何修复Pytorch中的“ RuntimeError:CUDA错误:设备端断言已触发”

时间:2019-10-04 19:48:40

标签: gpu pytorch yolo

我正在尝试从此仓库https://github.com/eriklindernoren/PyTorch-YOLOv3训练yolo-v3模型 我的形状的自定义数据集上,但我不断收到错误“ RuntimeError:CUDA错误:设备端断言已触发”

我尝试查找解决方案,并尝试了在不同答案中建议的几件事(例如,修复批注中类的索引),但是错误仍然存​​在。

我正在按照回购自述文件中的说明来训练自定义数据集,并相应地调整了custom.data和data / custom/。

我一直收到此输出。

C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [33,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [35,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [38,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [39,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [40,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [41,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [42,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [45,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [4,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [5,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [6,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [12,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [13,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [15,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [20,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [21,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [22,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [24,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [29,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "train.py", line 105, in <module>
    loss, outputs = model(imgs, targets)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "D:\Documents\GP\Code\TorchYolo\PyTorch-YOLOv3\models.py", line 259, in forward
    x, layer_loss = module[0](x, targets, img_dim)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "D:\Documents\GP\Code\TorchYolo\PyTorch-YOLOv3\models.py", line 188, in forward
    ignore_thres=self.ignore_thres,
  File "D:\Documents\GP\Code\TorchYolo\PyTorch-YOLOv3\utils\utils.py", line 318, in build_targets
    iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)
  File "D:\Documents\GP\Code\TorchYolo\PyTorch-YOLOv3\utils\utils.py", line 199, in bbox_iou
    b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
RuntimeError: CUDA error: device-side assert triggered

唯一的改变是,在与train.jpg类标签索引纠缠时,数组索引中的值为“ 2”

b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2

4 个答案:

答案 0 :(得分:4)

在Google colab上训练resnet模型时,我也遇到了这个问题。因此,在我的情况下,我在7个类上训练模型,但是我网络的最后一层设置为输出3个类。

因此我改变了

 it("mouse down event", async () => {
    const mockCallBack = jest.fn()
    props.output.selection = "CLICK"
    props.canvas.allPrimitives = true
    const wrapper = mount(
      <ReAlignedImageCanvas onMouseDown={mockCallBack} {...props} />
    )

    await wrapper.find(".showOutput").invoke("onMouseDown")(
      {
        nativeEvent: {
          offsetX: 200,
          offsetY: 180
        }
      },
      9000
    )
    expect(wrapper.state("dragStartX")).toBe(200)
    expect(wrapper.state("dragStartY")).toBe(180)
  })

对此

self.classifier = torch.nn.Sequential(torch.nn.BatchNorm1d(512), torch.nn.Linear(512, 3))

这样做之后,我也得到了错误,因为我没有重启google colab。

请记住,无论何时遇到此错误,请检查两件事:

  1. class_labels应该从0开始,即在我的情况下,对于7个类,是[0,1,2,3,4,5,6]。
  2. 检查最终输出层是否输出了准确的类数。

然后,

刷新笔记本以刷新所有cuda断言。

发生任何CUDA错误后,请重新启动笔记本计算机,否则将继续收到CUDA错误,因为尚未清除早期的断言。通过重新启动笔记本,您将清除所有cuda断言。

答案 1 :(得分:2)

通常,当您收到神秘的CUDA错误时,应切换到CPU,看看在那里是否获得更有意义的错误消息。

或者,设置CUDA_LAUNCH_BLOCKING=1以获得更多信息的堆栈跟踪(有关详细信息,请参见this answer)。

有关更多详细信息,请参见this answer


猜测,在您的情况下,似乎您的2分频会创建分数,而pytorch会寻找整数。试试

b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] // 2, box1[:, 0] + box1[:, 2] // 2

答案 2 :(得分:0)

将模型放在两个2080 Ti GPU上时,我也遇到了同样的错误。

invoke_register_none_:
0041B750 FFE2             jmp edx
0041B752 662E0F1F840000000000 db $66 $2e $0f $1f $84 $00 $00 $00 $00 $00 
0041B75C 0F1F4000         db $0f $1f $40 $00 
invoke_register_none_g:
0041B760 8B01             mov eax,[ecx]
0041B762 FFE2             jmp edx
0041B764 662E0F1F840000000000 db $66 $2e $0f $1f $84 $00 $00 $00 $00 $00 
0041B76E 6690             nop 
invoke_register_none_gg:
0041B770 8B01             mov eax,[ecx]
0041B772 8B4904           mov ecx,[ecx+$04]
0041B775 87D1             xchg ecx,edx
0041B777 FFE1             jmp ecx
0041B779 0F1F8000000000   db $0f $1f $80 $00 $00 $00 $00 

它已经运行了很多次,并且我对代码的任何修改都没有导致错误。 nvidia-smi表示两个健康的设备。重新启动机器可以解决此问题。这只是为了证明它实际上可能是由与设备的代码无关的奥秘设备问题引起的。

答案 3 :(得分:0)

  1. 在我的情况下,我让 class_labels 从 0 开始:在我的情况下 [0,1,2,3,4] 5 节课。
  2. 然后我刷新了谷歌笔记本,它工作了。