在Keras模型中使用预训练的嵌入矩阵的ValueError形状不良

时间:2018-08-06 15:09:15

标签: python keras deep-learning

        user_id tags
0   234 drama , police , year , perfect , space , mech...
1   382 short normal , city , movie short , thriller ,...
2   741 world , tv short seasonal , school , life , pe...

我以前像上面的数据框一样,为每个用户计算了15个最相关的单词,并用手套数据集构建了一个预训练的嵌入矩阵。

GLOVE = 'Mypath/Anime_project/glove.6B.300d.txt'
embeddings_index = {}
with open(GLOVE,encoding='utf8') as f:
    for line in tqdm(f):
        values = line.rstrip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

然后我使用Keras的分词器

tags_doc['doc_len'] = tags_doc["tags"].apply(lambda words: len(words.split(",")))
max_seq_len = np.round(tags_doc['doc_len'].mean() + tags_doc['doc_len'].std()).astype(int)
docs = tags_doc["tags"].tolist()
processed_docs = " ".join(docs).split(" , ")
print("tokenizing input data...")
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True, char_level=False)
tokenizer.fit_on_texts(processed_docs)  #leaky
word_sequence = tokenizer.texts_to_sequences(processed_docs)
word_index = tokenizer.word_index
print("dictionary size: ", len(word_index))

#pad sequences
word_padded = sequence.pad_sequences(word_sequence, maxlen=max_seq_len)
# split the data into a training set and a validation set
indices = np.arange(word_padded.shape[0])
np.random.shuffle(indices)
data = word_padded[indices]
VALIDATION_SPLIT=0.2
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]

x_train的形状为(904995,15)和x_val(226248,15)

embed_dim = 300
embedding_matrix = np.zeros((len(word_index) + 1, embed_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

然后我在Keras Functional API中添加该矩阵

embedding_layer = Embedding(len(word_index) + 1,
                            embed_dim,
                            weights=[embedding_matrix],
                            input_length=max_seq_len,
                            trainable=False)
sequence_input = Input(shape=(max_seq_len,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
embedded_sequences = Dropout(0.2)(embedded_sequences)

然后,当我拟合模型时,出现此错误

ValueError: All input arrays (x) should have the same number of samples. Got array shapes: [(64642, 1), (64642, 1), (904995, 15)]

我了解我的问题出在我的序列输入(x_train,x_val)的形状上,但我不知道如何解决?

1 个答案:

答案 0 :(得分:0)

似乎x_train和y_train的长度不相等。检查它们的长度。

    len(x_train)
    len(y_train)