RuntimeError:CUDA错误:无效参数

时间:2019-09-26 13:28:11

标签: pytorch

它可以成功运行epoch 1和eval,但是运行epoch 2则失败。

Train Epoch:1[655200/655800(100%)] loss:26.4959 lr:0.2050
Test Epoch:1 acc:0.973 val:0.895

Train Epoch:2[0/655800(0%)] loss:26.8068 lr:0.2051
File "train_11w.py", line 244, in main
    train(train_loader, model, optimizer, epoch, lr_decay_type, logger, args.log_interval, args)
  File "train_11w.py", line 305, in train
    prediction, ex, exnorm = model(img, mode=6, y=label)
  File "/home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
RuntimeError: CUDA error: invalid argument
Driver Version: 418.67
CUDA Version 10.0.130
python 2.7.3
torch 1.0.0

2 个答案:

答案 0 :(得分:0)

尽管很难理解问题所在,但我建议您执行以下操作。

  1. 您可以尝试使用CUDA_LAUNCH_BLOCKING=1 python script_name args运行代码吗? CUDA_LAUNCH_BLOCKING=1 env变量可确保同步调用所有CUDA操作,以便错误消息应指向堆栈跟踪中正确的代码行。
  2. 尝试将torch.backends.cudnn.benchmark设置为True/False,以检查其是否有效。
  3. 不使用DataParallel训练模型。
  4. 检查在创建DataLoader时是否设置了drop_last=True,培训是否有效?

答案 1 :(得分:0)

我使用了CUDA_LAUNCH_BLOCKING = 1,也失败了。 追溯(最近一次通话):   文件“ train_11w.py”,第691行,在     主(参数)   main中的文件“ train_11w.py”,第244行     火车(train_loader,模型,优化器,时期,lr_decay_type,记录器,args.log_interval,args)   火车中的文件“ train_11w.py”,第307行     预测,ex,exnorm =模型(img,mode = 6,y = label)   在致电中,文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py”,第489行     结果= self.forward(* input,** kwargs)   文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”,向前142行     副本= self.replicate(self.module,self.device_ids [:len(inputs)])   复制文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”,第147行     返回副本(模块,device_ids)   复制中的文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/replicate.py”,第13行     param_copies = Broadcast.apply(devices,* params)   文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/_functions.py”,向前,第21行     输出= comm.broadcast_coalesced(输入,ctx.target_gpus)   文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/cuda/comm.py”,第40行,位于broadcast_coalesced     返回割炬._C._broadcast_coalesced(张量,设备,buffer_size) RuntimeError:NCCL错误3:内部错误