user_id tags
0 234 drama , police , year , perfect , space , mech...
1 382 short normal , city , movie short , thriller ,...
2 741 world , tv short seasonal , school , life , pe...
我以前像上面的数据框一样,为每个用户计算了15个最相关的单词,并用手套数据集构建了一个预训练的嵌入矩阵。
GLOVE = 'Mypath/Anime_project/glove.6B.300d.txt'
embeddings_index = {}
with open(GLOVE,encoding='utf8') as f:
for line in tqdm(f):
values = line.rstrip().rsplit(' ')
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
然后我使用Keras的分词器
tags_doc['doc_len'] = tags_doc["tags"].apply(lambda words: len(words.split(",")))
max_seq_len = np.round(tags_doc['doc_len'].mean() + tags_doc['doc_len'].std()).astype(int)
docs = tags_doc["tags"].tolist()
processed_docs = " ".join(docs).split(" , ")
print("tokenizing input data...")
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True, char_level=False)
tokenizer.fit_on_texts(processed_docs) #leaky
word_sequence = tokenizer.texts_to_sequences(processed_docs)
word_index = tokenizer.word_index
print("dictionary size: ", len(word_index))
#pad sequences
word_padded = sequence.pad_sequences(word_sequence, maxlen=max_seq_len)
# split the data into a training set and a validation set
indices = np.arange(word_padded.shape[0])
np.random.shuffle(indices)
data = word_padded[indices]
VALIDATION_SPLIT=0.2
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
x_train的形状为(904995,15)和x_val(226248,15)
embed_dim = 300
embedding_matrix = np.zeros((len(word_index) + 1, embed_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
然后我在Keras Functional API中添加该矩阵
embedding_layer = Embedding(len(word_index) + 1,
embed_dim,
weights=[embedding_matrix],
input_length=max_seq_len,
trainable=False)
sequence_input = Input(shape=(max_seq_len,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
embedded_sequences = Dropout(0.2)(embedded_sequences)
然后,当我拟合模型时,出现此错误
ValueError: All input arrays (x) should have the same number of samples. Got array shapes: [(64642, 1), (64642, 1), (904995, 15)]
我了解我的问题出在我的序列输入(x_train,x_val)的形状上,但我不知道如何解决?
答案 0 :(得分:0)
似乎x_train和y_train的长度不相等。检查它们的长度。
len(x_train)
len(y_train)