我正在PyTorch中训练深度学习模型(3DCNN)。除了我在批处理大小方面遇到的一些问题外,它看起来运行得很好。
批量为128个时,训练进行得很好(10个纪元)。
但是,如果我将批处理大小减小到64或更小,它将在5个纪元后终止,并具有以下回溯。我不明白,每个时期的程序几乎相同,所以在每个时期的行为都不应有所不同。
关于什么可能引发这种情况的任何想法?
我在HPC群集上的4个GPU上运行。抱歉,我无法共享代码。
Traceback (most recent call last):
File "junin-3DCNN.mem.parallel.py", line 245, in <module>
path = modelpath)
File "/project/junin/deforestation_forecasting/python_code/Training.py", line 118, in train_model
output = model.forward(data, sigmoid = not require_sigmoid)
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/project/junin/deforestation_forecasting/python_code/ConvRNN.py", line 522, in forward
x= self.ln(x)
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward
exponential_average_factor, self.eps)
File "/home/anaconda3/envs/py37-2/lib/python3.7/site-packages/torch/nn/functional.py", line 1652, in batch_norm
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 100])