Question

我正在使用这样的自定义模型

class SimpleNN(nn.Module):
    def __init__(self, vectors_size, features_size, hidden_size=15, dropout_rate=0.1):
        super(SimpleNN, self).__init__()

        self.vectors_size = vectors_size
        self.features_size = features_size
        self.hidden_size = hidden_size
        self.dropout_rate = dropout_rate

        self.vectors_hidden = nn.Sequential(
            nn.Dropout(self.dropout_rate),
            nn.Linear(vectors_size, vectors_size//2),
            nn.Tanh(),
            nn.Linear(vectors_size//2, features_size),
            nn.Tanh()
        )
        self.hidden = nn.Sequential(
            nn.Linear(features_size*2, hidden_size),
            nn.ReLU(),
        )

        self.output = nn.Linear(hidden_size, 2)

    def forward(self, pairs, features):
        """
        features: (n_samples, features_size)
        """
        vectors = pairs2vectors(train_pub, pairs).to(device)
        embedding_features = self.vectors_hidden(vectors)
        combined_features = torch.cat([features, embedding_features], dim=1)
        return self.output(self.hidden(combined_features))

当我仅使用一个cuda时，此模型运行良好，但是在如下所示使用'DataParallel'之后，它总是告诉我features和embedding_features的大小不匹配，我发现n_samples功能的形状不像其他批处理数据那样符合我的期望，我不知道为什么以及如何解决此问题。

    if torch.cuda.device_count() > 1:
            print("Let's use", torch.cuda.device_count(), "GPUs!")
            model = nn.DataParallel(model)

顺便说一句，这是错误消息的图片，Error message 实际上，对于前向方法中的参数pairs和features，其大小分别为(batch_size, pair)和(batch_size, features_size)，在我的train函数中，代码如下：

for batch_pairs, batch_data, batch_labels in tqdm(batch_iter(train_pairs, train_data, train_labels, train_batch_size), desc='Epoch'):
            train_iter += 1

            optimizer.zero_grad()

            batch_size = len(data)
            cum_examples += batch_size
            pred_labels = model(batch_pairs, batch_data)

            # Loss use torch.nn.CrossEntropyLoss
            loss_func = torch.nn.CrossEntropyLoss(weight=weight)
            loss = loss_func(pred_labels, batch_labels)

            # Backpropagation
            loss.backward()

            # Gradient clip
            grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad)

            # Update gradient
            optimizer.step()

我认为我的batch_pairs和batch_features已经具有batch_size了，是吗？

这是torch.nn.DataParallel的{{3}}，它说我的输入将被分割成多个服务器部分进行处理，在我的情况下，似乎只有{{1} }参数是分块的，但features_size不是，也许不是pairs实例？我会尝试更多，我认为这种机制不合理。

我尝试将变换对转换为外部向量，并将其传递给torch.Tensor方法，但仍然无法正常工作。

Answer 1

检查此，问题是因为批次维度未在输入数据中传递。如果是这样，

 nn.DataParallel

可能在错误的尺寸上分割了。您还提到了

features: (n_samples, features_size)

，这意味着未在输入中传递批处理大小。请为您的数据添加一个批次维度。对于1的批量大小，您的输入形状应为[1, features]。因此，对于您的情况，应该是[1, n_samples, features_size]

希望这会对您有所帮助。

在pytorch中使用多cuda（Dataparallel）时大小不匹配

1 个答案: