Question

我使用人机对话中的文本数据，其中每个对话包含多个转弯（转而又是一个单词序列），我想预测每个转弯的标签。

我的想法是使用递归神经网络遍历固定大小的转弯表示形式，我使用基于某些功能的简单多层感知器来计算（因此在这里不会在转弯处绕开单词）。

在下面的代码示例中，这就是我的操作方式。

我的问题：如果我在for循环中一步一步地计算转弯表示，然后通过torch.stack()进行组合，那么损失最终会在所有MLP时间步长（即所有前向计算）中反向传播吗？还是仅考虑for循环中的最后一步进行损耗计算？还有另一种方法吗？

PS：因为我想为训练数据中的不同项目计算不同的特征集，所以我需要转弯MLP将转弯转换为固定大小，否则当然没有必要。


class TurnMLP(nn.Module):
  def __init__(self):
    self.linear = nn.Linear(4, 20)
    super(TurnMLP, self).__init__()

  def forward(self, x_turn):
    return torch.sigmoid(self.linear(x_turn))

class DialogRNN(nn.Module):
  def __init__(self):
    self.turn_mlp = TurnMLP()
    self.lstm = nn.LSTM(20, 15)
    self.dense = nn.Linear(15, 3)
    super(DialogRNN, self).__init__()

  def forward(self, x_dialog):  # x_dialog is a list of turns, a turn is a feature vector
    turns_hidden = []

    # Here comes the interesting part: is it okay to call the turn_mlp at each timestep and expect it to store all gradients? I guess not, right?
    for x_turn in x_dialog:
      turn_hidden.append(self.turn_mlp.forward(x_turn)
    turns_hidden = torch.stack(turns_hidden)
    turns_outputs, _ = self.lstm(turns_hidden)
    return torch.sigmoid(self.dense(turns_outputs))

# TRAINING
dialog_rnn = DialogRNN()

# a dialog with 3 turns, converted to feature vectors
sample_dialog = [torch.tensor([0,1,0,1]), torch.tensor([1,0,1,1]), torch.tensor([1,0,0,0])]  
sample_labels = torch.tensor([2,0,1])

prediction = dialog_rnn.forward(sample_dialog)
loss = torch.nn.functional.cross_entropy(prediction, sample_labels)
optimizer = torch.optim.Adam(dialog_rnn.parameters())

loss.backward()
optimizer.step()

Pytorch：在RNN时间步的for循环中调用其他模块

0 个答案: