我正在尝试在pytorch中使用GRU生成句子向量。我的故事被分成句子。所以我读了这些故事,用单词将句子分开,然后将它们转换成整数。例如,具有2个样品的批次为:
(0 ,.,.) =
19 21 28 3 0 0
25 16 28 4 17 0
0 0 0 0 0 0
(1 ,.,.) =
19 21 28 3 0 0
25 16 28 4 0 0
15 28 26 27 17 0
每一行是一个句子,每个数字是一个单词。我想要的是每一行都有一个矢量表示。假设尺寸为5:
(0 ,.,.) =
0.1619 -0.0605 -0.3301 -0.0433 0.2786
-0.2069 -0.3152 -0.4366 0.1272 0.3375
0 0 0 0 0
(1 ,.,.) =
-0.1599 -0.0730 -0.3796 0.0214 0.2157
-0.0805 -0.1307 -0.3942 0.0648 0.2704
0.0275 -0.2353 -0.4399 0.0687 0.3218
以前,我是通过对单词向量求和来生成句子向量的,模型准确性为99%。使用此代码(带有GRU的编码器),我只能获得30%的值:
embedding_dim = 5
input_embed= embedding(story.view(story.size(0), -1))
slen = torch.LongTensor([2,3])
sentlens = torch.DoubleTensor([[4,5,0],[4,4,5]])
def encoder_gru(input_embed,slen,sentlens,n_layers=1,hidden=None):
""" for one sample
Args:
input_embed : embedding of input text of size batch*nbrSent*nbrWord*dim
slen : tensor, number of sentences in the story
sentlens : tensor, number of words in each sentences
hidden : tensor, init hidden layer of gru
"""
# choose not-padded sentences from story
batch_size = input_embed.size(0)
# contains sentence vectors
hidden_batch = torch.zeros(batch_size,max(slen),embedding_dim)
st = 0
for b in range(batch_size):
iembed = input_embed[b,0:slen[b]] # takes non-padded rows
bsent = sentlens[b,0:slen[b]] # takes number of words in non-padded sentences
# get ordered pack
sorted_slens,idx = bsent.sort(0,descending=True)
sorted_iembed = iembed[idx]
pack = torch.nn.utils.rnn.pack_padded_sequence(sorted_iembed, sorted_slens.tolist(), batch_first=True)
h0 = Variable(torch.randn(n_layers,slen[b],embedding_dim))
out,hidden_out = self.gru(pack,h0)
# unpacked, unpacked_len = torch.nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
_,inv_idx = idx.sort() # sort the outputs again, undo 1st order
hidden_batch[b,0:slen[b]] = hidden_out[-1][inv_idx].data.clone()
return hidden_batch
我的问题是: