Question

我已经训练并测试了CNN用于情感分析。训练和测试数据的准备方式相同，将句子标记化并给出唯一的整数：

tokenizer = Tokenizer(filters='$%&()*/:;<=>@[\\]^`{|}~\t\n')
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
sequences = tokenizer.texts_to_sequences(text)

然后预先训练手套模型以如下方式为CNN创建嵌入矩阵：

filepath_glove = 'glove.twitter.27B.200d.txt'
glove_vocab = []
glove_embd=[]
embedding_dict = {}

file = open(filepath_glove,'r',encoding='UTF-8')
for line in file.readlines():
    row = line.strip().split(' ')
    vocab_word = row[0]
    glove_vocab.append(vocab_word)
    embed_vector = [float(i) for i in row[1:]] # convert to list of float
    embedding_dict[vocab_word]=embed_vector
   file.close()
  for word, index in tokenizer.word_index.items(): 
 `embedding_matrix[index] = embedding_dict[word]`

这时，我还使用测试语句来创建此矩阵，该矩阵随后作为权重传递到嵌入层：

e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)

现在我想重新加载模型并使用一些新数据进行测试，但这意味着嵌入矩阵中将不会包含来自新数据的一些单词，这让我想知道在创建嵌入矩阵时是否甚至不应该包含测试数据？如果不是，那么嵌入层如何处理这些新单词？这部分与此问题类似，但是我找不到答案： How does the Keras Embedding Layer work if word is not found? 谢谢

Answer 1

这很简单。您正在提供vocab_size，即嵌入层知道的字数。如果传递的索引超出vocab_size（新单词）的范围，则它将被忽略，否则keras会引发错误。

这回答了有关是否应包括嵌入矩阵的所有数据的问题。是的，你应该。

使用keras对文本数据进行预测

1 个答案: