它可以成功运行epoch 1和eval,但是运行epoch 2则失败。
Train Epoch:1[655200/655800(100%)] loss:26.4959 lr:0.2050
Test Epoch:1 acc:0.973 val:0.895
Train Epoch:2[0/655800(0%)] loss:26.8068 lr:0.2051
File "train_11w.py", line 244, in main
train(train_loader, model, optimizer, epoch, lr_decay_type, logger, args.log_interval, args)
File "train_11w.py", line 305, in train
prediction, ex, exnorm = model(img, mode=6, y=label)
File "/home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
RuntimeError: CUDA error: invalid argument
Driver Version: 418.67
CUDA Version 10.0.130
python 2.7.3
torch 1.0.0
答案 0 :(得分:0)
尽管很难理解问题所在,但我建议您执行以下操作。
CUDA_LAUNCH_BLOCKING=1 python script_name args
运行代码吗? CUDA_LAUNCH_BLOCKING=1
env变量可确保同步调用所有CUDA操作,以便错误消息应指向堆栈跟踪中正确的代码行。torch.backends.cudnn.benchmark
设置为True/False
,以检查其是否有效。drop_last=True
,培训是否有效?答案 1 :(得分:0)
我使用了CUDA_LAUNCH_BLOCKING = 1,也失败了。 追溯(最近一次通话): 文件“ train_11w.py”,第691行,在 主(参数) main中的文件“ train_11w.py”,第244行 火车(train_loader,模型,优化器,时期,lr_decay_type,记录器,args.log_interval,args) 火车中的文件“ train_11w.py”,第307行 预测,ex,exnorm =模型(img,mode = 6,y = label) 在致电中,文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py”,第489行 结果= self.forward(* input,** kwargs) 文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”,向前142行 副本= self.replicate(self.module,self.device_ids [:len(inputs)]) 复制文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”,第147行 返回副本(模块,device_ids) 复制中的文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/replicate.py”,第13行 param_copies = Broadcast.apply(devices,* params) 文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/_functions.py”,向前,第21行 输出= comm.broadcast_coalesced(输入,ctx.target_gpus) 文件“ /home/luban/anaconda2/lib/python2.7/site-packages/torch/cuda/comm.py”,第40行,位于broadcast_coalesced 返回割炬._C._broadcast_coalesced(张量,设备,buffer_size) RuntimeError:NCCL错误3:内部错误