Question

我的训练适用于在单GPU上训练的迷你批次（默认）。

if USE_CUDA:
    encoderchar = encoderchar.cuda()
    encoder = encoder.cuda()
    decoder = decoder.cuda()

但是，当我使用所有可用的GPU进行训练时，我收到错误消息。

if USE_CUDA:
    encoderchar = torch.nn.DataParallel(encoderchar, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
    encoder =  torch.nn.DataParallel(encoder, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
    decoder = torch.nn.DataParallel(decoder, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
    encoderchar = encoderchar.cuda()
    encoder = encoder.cuda()
    decoder = decoder.cuda()

我在转发期间收到以下错误。

RuntimeError                              Traceback (most recent call last)
<ipython-input-10-227f3e86847c> in <module>()
18         loss, ar1, ar2 = train(data_input_batch_index, data_input_batch_length, data_target_batch_index, data_target_batch_length, 
19                                encoderchar, encoder, decoder, encoderchar_optimizer, encoder_optimizer, decoder_optimizer,
---> 20                                criterion, batch_size)
21 
22         # Keep track of loss
<ipython-input-8-21861d792653> in train(input_batch, input_batch_length, target_batch, target_batch_length, encoderchar, encoder, decoder, encoderchar_optimizer, encoder_optimizer, decoder_optimizer, criterion, batch_size)
21             #reshaped_input_length =  Variable(torch.LongTensor(reshaped_input_length)).cuda()
22         hidden_all, output = encoderchar(w, reshaped_input_length)
---> 23         encoder_input[ix] = output.transpose(0,1).contiguous().view(batch_size, -1)
24 
25     temporary_target_batch_length = [15] * batch_size
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/autograd/variable.py in __setitem__(self, key, value)
78         else:
79             if isinstance(value, Variable):
---> 80                 return SetItem(key)(self, value)
81             else:
82                 return SetItem(key, value)(self)
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py in forward(self, i, value)
37         else:  # value is Tensor
38             self.value_size = value.size()
---> 39         i._set_index(self.index, value)
40         return i
41 

RuntimeError: sizes do not match at /py/conda-bld/pytorch_1493681908901/work/torch/lib/THC/THCTensorCopy.cu:31

一个cuda长张量和一个列表是传递给encoderchar前馈的参数类型。

hidden_all, output = encoderchar(w, reshaped_input_length)
encoder_input[ix] = output.transpose(0,1).contiguous().view(batch_size, -1)

nvidia-smi在抛出错误后显示以下内容。

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      
|    0     18320    C   python                                         453MiB |
|    1     18320    C   python                                         266MiB |
|    2     18320    C   python                                         266MiB |
|    3     18320    C   python                                         266MiB |
|    4     18320    C   python                                         266MiB |
|    5     18320    C   python                                         266MiB |
|    6     18320    C   python                                         266MiB |
|    7     18320    C   python                                         262MiB |
+-----------------------------------------------------------------------------+

这里有什么问题？

Answer 1

DataParallel需要知道哪个dim要分割输入数据（即哪个dim是batch_size）。它假定（默认情况下）表示dim = 0。

中输入的batch_size的维度

对于encoderchar模块的输入，批量大小为dim 1。

因此，要么修改DataParallel实例，指定dim=1：

encoderchar = torch.nn.DataParallel(encoderchar, device_ids=[0, 1, 2, 3, 4, 5, 6, 7], dim=1)

或者，通过这样做来改变输入大小，（将batch_size dim移动到0）：

w = w.view(batch_size, -1)

PyTorch Multi-GPU K80s Batch for tensors

我在转发期间收到以下错误。

1 个答案: