Count Vectorizer无法正确预测神经网络

时间:2019-05-22 13:19:11

标签: python tensorflow machine-learning tflearn countvectorizer

我正在尝试使用文本和IMDB随机评论的分数进行情感分析预测。我把所有的单词都变成了“单词袋”,并将它们全部放入神经网络。但是,该预测似乎并不正确,对于我输入的任何评论,它总是显示出50%的阳性预测和50%的阴性预测。

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())
print(labels.head())

enter image description here

#Split into train/test
x_train, x_test, y_train, y_test = train_test_split(reviews,Y)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train)

#min_df = 19 seems to be the first number that fills all 10 000 entries - thus the 10 most commonly used words

vect = CountVectorizer(min_df=19, max_features=10000)
fitter = vect.fit(x_train[0])

X_train = fitter.transform(x_train[0])
X_test = fitter.transform(x_test[0])
X_val = fitter.transform(x_val[0])

print("Vocabulary size: {}".format(len(vect.vocabulary_)))

feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))

print("Vocabulary content:\n {}".format(fitter.vocabulary_))

enter image description here

X_train = pad_sequences(X_train.toarray(), maxlen=100, value=0.)
X_test = pad_sequences(X_test.toarray(), maxlen=100, value=0.)
X_val = pad_sequences(X_val.toarray(), maxlen=100, value=0.)
Y_train = to_categorical(y_train, 2)
Y_test = to_categorical(y_test, 2)
Y_val = to_categorical(y_val, 2)


tensorflow.reset_default_graph()

input_layer = tflearn.input_data(shape=[None, 100])
net = tflearn.embedding(input_layer, input_dim=10000, output_dim=128)
hid = tflearn.fully_connected(input_layer, 10, activation='tanh') # a hidden layer with 10 neurons
output_layer = tflearn.fully_connected(hid, 2, activation='softmax')

sgd = tflearn.SGD(learning_rate=0.04, lr_decay=0.96, decay_step=1000)
net = tflearn.regression(output_layer, optimizer=sgd, loss='categorical_crossentropy')


model = tflearn.DNN(net, tensorboard_verbose=3, tensorboard_dir='tfdir')
try:
    model.fit(X_train, Y_train, n_epoch=5, validation_set=(X_val, Y_val), batch_size=100, show_metric=True, run_id="Imdb")
except KeyboardInterrupt as e:
    print("Stopped by user")

enter image description here

无论我调整超参数有多少,训练,验证和测试的准确性始终最大约为0.65。

my_review = "This movie sucks"
my_review_enc = fitter.transform([my_review])
my_review_enc_pad = pad_sequences(my_review_enc.toarray(), maxlen=100, value=0.)
prediction = model.predict(my_review_enc_pad)
prediction

enter image description here

如您所见,正面和负面的预测总是在50%

我在做什么错了?

0 个答案:

没有答案