训练与小批量相关的3DCNN时出错(PyTorch)

时间:2020-04-29 12:13:45

标签: python pytorch conv-neural-network

我正在PyTorch中训练深度学习模型(3DCNN)。除了我在批处理大小方面遇到的一些问题外,它看起来运行得很好。

批量为128个时,训练进行得很好(10个纪元)。

但是,如果我将批处理大小减小到64或更小,它将在5个纪元后终止,并具有以下回溯。我不明白,每个时期的程序几乎相同,所以在每个时期的行为都不应有所不同。

关于什么可能引发这种情况的任何想法?

我在HPC群集上的4个GPU上运行。抱歉,我无法共享代码。

Traceback (most recent call last):
  File "junin-3DCNN.mem.parallel.py", line 245, in <module>
    path = modelpath)
  File "/project/junin/deforestation_forecasting/python_code/Training.py", line 118, in train_model
    output = model.forward(data, sigmoid = not require_sigmoid)    
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/junin/deforestation_forecasting/python_code/ConvRNN.py", line 522, in forward
    x= self.ln(x)
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward
    exponential_average_factor, self.eps)
  File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/functional.py", line 1652, in batch_norm
    raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 100])

0 个答案:

没有答案