我不确定为什么会不断收到此错误。我检查了我实际的标记化+编码文本数据的长度,它与我选择的输入长度匹配。代码如下:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
training_samples = 6603
max_words = 10000 # We will only consider the top 10,000 words in the dataset
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
print('Shape of data tensor:', X_train.shape)
max_length = 25
padded_s = pad_sequences(sequences, maxlen=max_length, padding='post')
print(padded_s)
print(padded_s.shape)
y_train = np.array(y_train)
y_test = np.array(y_test)
由此-输出为:
Found 10759 unique tokens.
Shape of data tensor: (5942,)
[[ 17 119 154 ... 0 0 0]
[ 31 116 40 ... 0 0 0]
[1925 1711 15 ... 184 0 0]
...
[ 6 1915 375 ... 0 0 0]
[ 693 190 24 ... 0 0 0]
[ 1 570 2 ... 0 0 0]]
**(5942, 25)**
从上面可以看到,它是25,而不是1!
glove_dir = '/Users/xxx/Downloads/glove.6B'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if i < max_words:
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
model = Sequential()
model.add(Embedding(max_words, embedding_dim, weights=[embedding_matrix], input_length=max_length, trainable=False))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(padded_s, y_train,
epochs=10,
batch_size=32,
validation_data=(X_val, y_val))
model.save_weights('pre_trained_glove_model.h5')
这将返回错误:
ValueError: Error when checking input: expected embedding_1_input to have shape (25,) but got array with shape (1,)
任何帮助将不胜感激-非常感谢!
答案 0 :(得分:0)
已修复:我忘了将验证集转换为序列+填充序列。