输入数组应具有与目标数组相同的样本数

时间:2018-04-15 20:03:10

标签: python neural-network keras

我在训练网络时遇到错误:Input arrays should have the same number of samples as target arrays. Found 50000 input samples and 25000 target samples. x_train形状(50000,200) y_train(25000,) x_test(50000,200) y_test(25000,) 如何更正错误? 我用keras写的网络 来自csv文件的数据集加载

将字符串列表转换为整数列表

对于x_train中的子列表,

x_train = [子列表中elt的[int(elt)]

x_test = [[int(elt) for elt in sublist] for sublist in x_test]
y_train = [int(elt) for elt in y_train]
y_test = [int(elt) for elt in y_test]


# convert integer reviews into word reviews
x_full = x_train + x_test
x_full_words = [[index_to_word[idx] for idx in rev if idx!=0] for rev in x_full]
all_words = [word for rev in x_full_words for word in rev]
if use_pretrained:

    # initialize word vectors
    word_vectors = Word2Vec(size=word_vector_dim, min_count=1)

    # create entries for the words in our vocabulary
    word_vectors.build_vocab(x_full_words)

    # sanity check
    assert (len(list(set(all_words))) == len(word_vectors.wv.vocab)), "3rd sanity check failed!"

    # fill entries with the pre-trained word vectors
    word_vectors.intersect_word2vec_format(path_to_pretrained_wv + 'GoogleNews-vectors-negative300.bin.gz', binary=True)

    print('pre-trained word vectors loaded')

    norms = [np.linalg.norm(word_vectors[word]) for word in
             list(word_vectors.wv.vocab)]  # in Python 2.7: word_vectors.wv.vocab.keys()
    idxs_zero_norms = [idx for idx, norm in enumerate(norms) if norm < 0.05]
    no_entry_words = [list(word_vectors.wv.vocab)[idx] for idx in idxs_zero_norms]
    print('# of vocab words w/o a Google News entry:', len(no_entry_words))

    # create numpy array of embeddings
    embeddings = np.zeros((max_features + 1, word_vector_dim))
    for word in list(word_vectors.wv.vocab):
        idx = word_to_index[word]
        # word_to_index is 1-based! the 0-th row, used for padding, stays at zero
        embeddings[idx,] = word_vectors[word]

    print('embeddings created')

else:
    print('not using pre-trained embeddings')

和模型

#model
my_input = Input(shape=(max_size,)) # we leave the 2nd argument of shape blank because the Embedding layer cannot accept an input_shape argument

if use_pretrained:
    embedding = Embedding(input_dim=embeddings.shape[0], # vocab size, including the 0-th word used for padding
                          output_dim=word_vector_dim,
                          weights=[embeddings], # we pass our pre-trained embeddings
                          input_length=max_size,
                          trainable=not do_static,
                          ) (my_input)
else:
    embedding = Embedding(input_dim=max_features + 1,
                          output_dim=word_vector_dim,
                          trainable=not do_static,
                          ) (my_input)

embedding_dropped = Dropout(drop_rate)(embedding)

# feature map size should be equal to max_size-filter_size+1
# tensor shape after conv layer should be (feature map size, nb_filters)
print('branch A:',nb_filters,'feature maps of size',max_size-filter_size_a+1)
print('branch B:',nb_filters,'feature maps of size',max_size-filter_size_b+1)

# A branch
conv_a = Conv1D(filters = nb_filters,
              kernel_size = filter_size_a,
              activation = 'relu',
              )(embedding_dropped)

pooled_conv_a = GlobalMaxPooling1D()(conv_a)

pooled_conv_dropped_a = Dropout(drop_rate)(pooled_conv_a)

# B branch
conv_b = Conv1D(filters = nb_filters,
              kernel_size = filter_size_b,
              activation = 'relu',
              )(embedding_dropped)

pooled_conv_b = GlobalMaxPooling1D()(conv_b)

pooled_conv_dropped_b = Dropout(drop_rate)(pooled_conv_b)

concat = Concatenate()([pooled_conv_dropped_a,pooled_conv_dropped_b])

concat_dropped = Dropout(drop_rate)(concat)

# we finally project onto a single unit output layer with sigmoid activation
prob = Dense(units = 1, # dimensionality of the output space
             activation = 'sigmoid',
             )(concat_dropped)

model = Model(my_input, prob)
model.layers[4].output_shape # dimensionality of document encodings (nb_filters*2)

1 个答案:

答案 0 :(得分:0)

我无法确切地说出问题的根源是基于您对数据的描述,但是每个输入样本(x)必须有一个目标(y),这就是为什么出现错误{{1} }。您的形状(以及Keras错误)表明,您有50,000个实例,只有25,000个目标。

csv数据中有错误,或者在导入过程中某处出现导入或处理错误。