使用pose2body训练时发生CUDNN_STATUS_MAPPING_ERROR

时间:2018-11-09 09:03:57

标签: deep-learning pytorch cudnn

我正在尝试训练https://github.com/NVIDIA/vid2vid。我是...

  • ...在执行自述文件中显示的基本参数化设置后,我不得不更改GPU的数量,并增加了读取数据集的线程数量。命令:

    python train.py \ --name pose2body_256p \ --dataroot datasets/pose \ --dataset_mode pose \ --input_nc 6 \ --num_D 2 \ --resize_or_crop ScaleHeight_and_scaledCrop \ --loadSize 384 \ --fineSize 256 \ --gpu_ids 0,1 \ --batchSize 1 \ --max_frames_per_gpu 3 \ --no_first_img \ --n_frames_total 12 \ --max_t_step 4 \ --nThreads 6

  • ...对提供的示例数据集进行培训。

  • ...运行使用vid2vid/docker中的脚本构建的Docker容器,例如G。使用CUDA 9.0和CUDNN 7。
  • ...使用两个NVIDIA V100 GPU。

每当我开始训练时,几分钟后,脚本就会崩溃,并显示消息 RuntimeError: CUDNN_STATUS_MAPPING_ERROR 。完整的错误消息:

Traceback (most recent call last):
  File "train.py", line 329, in <module>
    train()
  File "train.py", line 104, in train
    fake_B, fake_B_raw, flow, weight, real_A, real_Bp, fake_B_last = modelG(input_A, input_B, inst_A, fake_B_last)            
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 114, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
    raise output
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/vid2vid/models/vid2vid_model_G.py", line 130, in forward
    fake_B, fake_B_raw, flow, weight = self.generate_frame_train(netG, real_A_all, fake_B_prev, start_gpu, is_first_frame)        
  File "/vid2vid/models/vid2vid_model_G.py", line 175, in generate_frame_train
    fake_B_feat, flow_feat, fake_B_fg_feat, use_raw_only)
  File "/vid2vid/models/networks.py", line 171, in forward
    downsample = self.model_down_seg(input) + self.model_down_img(img_prev)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDNN_STATUS_MAPPING_ERROR

通过阅读使用两个V100的vid2vid中的问题,可以使用此设置。如果使用CUDA 8 / CUDNN 6,也会发生此错误。我检查了这些标志,但没有发现任何迹象表明需要进一步更改提供给train.py的参数。

关于如何解决(或解决)此问题的任何想法?

1 个答案:

答案 0 :(得分:1)

万一有人遇到同样的问题:P100卡的培训有效。似乎V100架构有时会与提供的Dockerfile中使用的pytorch版本冲突。不是一个解决方案,而是一种解决方法。