我在训练网络时遇到错误:Input arrays should have the same number of samples as target arrays. Found 50000 input samples and 25000 target samples.
x_train形状(50000,200)
y_train(25000,)
x_test(50000,200)
y_test(25000,)
如何更正错误?
我用keras写的网络
来自csv文件的数据集加载
将字符串列表转换为整数列表
对于x_train中的子列表,x_train = [子列表中elt的[int(elt)]
x_test = [[int(elt) for elt in sublist] for sublist in x_test]
y_train = [int(elt) for elt in y_train]
y_test = [int(elt) for elt in y_test]
# convert integer reviews into word reviews
x_full = x_train + x_test
x_full_words = [[index_to_word[idx] for idx in rev if idx!=0] for rev in x_full]
all_words = [word for rev in x_full_words for word in rev]
if use_pretrained:
# initialize word vectors
word_vectors = Word2Vec(size=word_vector_dim, min_count=1)
# create entries for the words in our vocabulary
word_vectors.build_vocab(x_full_words)
# sanity check
assert (len(list(set(all_words))) == len(word_vectors.wv.vocab)), "3rd sanity check failed!"
# fill entries with the pre-trained word vectors
word_vectors.intersect_word2vec_format(path_to_pretrained_wv + 'GoogleNews-vectors-negative300.bin.gz', binary=True)
print('pre-trained word vectors loaded')
norms = [np.linalg.norm(word_vectors[word]) for word in
list(word_vectors.wv.vocab)] # in Python 2.7: word_vectors.wv.vocab.keys()
idxs_zero_norms = [idx for idx, norm in enumerate(norms) if norm < 0.05]
no_entry_words = [list(word_vectors.wv.vocab)[idx] for idx in idxs_zero_norms]
print('# of vocab words w/o a Google News entry:', len(no_entry_words))
# create numpy array of embeddings
embeddings = np.zeros((max_features + 1, word_vector_dim))
for word in list(word_vectors.wv.vocab):
idx = word_to_index[word]
# word_to_index is 1-based! the 0-th row, used for padding, stays at zero
embeddings[idx,] = word_vectors[word]
print('embeddings created')
else:
print('not using pre-trained embeddings')
和模型
#model
my_input = Input(shape=(max_size,)) # we leave the 2nd argument of shape blank because the Embedding layer cannot accept an input_shape argument
if use_pretrained:
embedding = Embedding(input_dim=embeddings.shape[0], # vocab size, including the 0-th word used for padding
output_dim=word_vector_dim,
weights=[embeddings], # we pass our pre-trained embeddings
input_length=max_size,
trainable=not do_static,
) (my_input)
else:
embedding = Embedding(input_dim=max_features + 1,
output_dim=word_vector_dim,
trainable=not do_static,
) (my_input)
embedding_dropped = Dropout(drop_rate)(embedding)
# feature map size should be equal to max_size-filter_size+1
# tensor shape after conv layer should be (feature map size, nb_filters)
print('branch A:',nb_filters,'feature maps of size',max_size-filter_size_a+1)
print('branch B:',nb_filters,'feature maps of size',max_size-filter_size_b+1)
# A branch
conv_a = Conv1D(filters = nb_filters,
kernel_size = filter_size_a,
activation = 'relu',
)(embedding_dropped)
pooled_conv_a = GlobalMaxPooling1D()(conv_a)
pooled_conv_dropped_a = Dropout(drop_rate)(pooled_conv_a)
# B branch
conv_b = Conv1D(filters = nb_filters,
kernel_size = filter_size_b,
activation = 'relu',
)(embedding_dropped)
pooled_conv_b = GlobalMaxPooling1D()(conv_b)
pooled_conv_dropped_b = Dropout(drop_rate)(pooled_conv_b)
concat = Concatenate()([pooled_conv_dropped_a,pooled_conv_dropped_b])
concat_dropped = Dropout(drop_rate)(concat)
# we finally project onto a single unit output layer with sigmoid activation
prob = Dense(units = 1, # dimensionality of the output space
activation = 'sigmoid',
)(concat_dropped)
model = Model(my_input, prob)
model.layers[4].output_shape # dimensionality of document encodings (nb_filters*2)
答案 0 :(得分:0)
我无法确切地说出问题的根源是基于您对数据的描述,但是每个输入样本(x)必须有一个目标(y),这就是为什么出现错误{{1} }。您的形状(以及Keras错误)表明,您有50,000个实例,只有25,000个目标。
csv数据中有错误,或者在导入过程中某处出现导入或处理错误。