我的训练适用于在单GPU上训练的迷你批次(默认)。
if USE_CUDA:
encoderchar = encoderchar.cuda()
encoder = encoder.cuda()
decoder = decoder.cuda()
但是,当我使用所有可用的GPU进行训练时,我收到错误消息。
if USE_CUDA:
encoderchar = torch.nn.DataParallel(encoderchar, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
encoder = torch.nn.DataParallel(encoder, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
decoder = torch.nn.DataParallel(decoder, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
encoderchar = encoderchar.cuda()
encoder = encoder.cuda()
decoder = decoder.cuda()
RuntimeError Traceback (most recent call last)
<ipython-input-10-227f3e86847c> in <module>()
18 loss, ar1, ar2 = train(data_input_batch_index, data_input_batch_length, data_target_batch_index, data_target_batch_length,
19 encoderchar, encoder, decoder, encoderchar_optimizer, encoder_optimizer, decoder_optimizer,
---> 20 criterion, batch_size)
21
22 # Keep track of loss
<ipython-input-8-21861d792653> in train(input_batch, input_batch_length, target_batch, target_batch_length, encoderchar, encoder, decoder, encoderchar_optimizer, encoder_optimizer, decoder_optimizer, criterion, batch_size)
21 #reshaped_input_length = Variable(torch.LongTensor(reshaped_input_length)).cuda()
22 hidden_all, output = encoderchar(w, reshaped_input_length)
---> 23 encoder_input[ix] = output.transpose(0,1).contiguous().view(batch_size, -1)
24
25 temporary_target_batch_length = [15] * batch_size
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/autograd/variable.py in __setitem__(self, key, value)
78 else:
79 if isinstance(value, Variable):
---> 80 return SetItem(key)(self, value)
81 else:
82 return SetItem(key, value)(self)
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py in forward(self, i, value)
37 else: # value is Tensor
38 self.value_size = value.size()
---> 39 i._set_index(self.index, value)
40 return i
41
RuntimeError: sizes do not match at /py/conda-bld/pytorch_1493681908901/work/torch/lib/THC/THCTensorCopy.cu:31
一个cuda长张量和一个列表是传递给encoderchar前馈的参数类型。
hidden_all, output = encoderchar(w, reshaped_input_length)
encoder_input[ix] = output.transpose(0,1).contiguous().view(batch_size, -1)
nvidia-smi在抛出错误后显示以下内容。
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage
| 0 18320 C python 453MiB |
| 1 18320 C python 266MiB |
| 2 18320 C python 266MiB |
| 3 18320 C python 266MiB |
| 4 18320 C python 266MiB |
| 5 18320 C python 266MiB |
| 6 18320 C python 266MiB |
| 7 18320 C python 262MiB |
+-----------------------------------------------------------------------------+
这里有什么问题?
答案 0 :(得分:0)
DataParallel
需要知道哪个dim要分割输入数据(即哪个dim是batch_size)。它假定(默认情况下)表示dim = 0。
对于encoderchar
模块的输入,批量大小为dim 1。
因此,要么修改DataParallel
实例,指定dim=1
:
encoderchar = torch.nn.DataParallel(encoderchar, device_ids=[0, 1, 2, 3, 4, 5, 6, 7], dim=1)
或者,通过这样做来改变输入大小,(将batch_size dim移动到0):
w = w.view(batch_size, -1)