我正在尝试使用文本和IMDB随机评论的分数进行情感分析预测。我把所有的单词都变成了“单词袋”,并将它们全部放入神经网络。但是,该预测似乎并不正确,对于我输入的任何评论,它总是显示出50%的阳性预测和50%的阴性预测。
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)
print(type(reviews))
print(reviews.head())
print(labels.head())
#Split into train/test
x_train, x_test, y_train, y_test = train_test_split(reviews,Y)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train)
#min_df = 19 seems to be the first number that fills all 10 000 entries - thus the 10 most commonly used words
vect = CountVectorizer(min_df=19, max_features=10000)
fitter = vect.fit(x_train[0])
X_train = fitter.transform(x_train[0])
X_test = fitter.transform(x_test[0])
X_val = fitter.transform(x_val[0])
print("Vocabulary size: {}".format(len(vect.vocabulary_)))
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("Vocabulary content:\n {}".format(fitter.vocabulary_))
X_train = pad_sequences(X_train.toarray(), maxlen=100, value=0.)
X_test = pad_sequences(X_test.toarray(), maxlen=100, value=0.)
X_val = pad_sequences(X_val.toarray(), maxlen=100, value=0.)
Y_train = to_categorical(y_train, 2)
Y_test = to_categorical(y_test, 2)
Y_val = to_categorical(y_val, 2)
tensorflow.reset_default_graph()
input_layer = tflearn.input_data(shape=[None, 100])
net = tflearn.embedding(input_layer, input_dim=10000, output_dim=128)
hid = tflearn.fully_connected(input_layer, 10, activation='tanh') # a hidden layer with 10 neurons
output_layer = tflearn.fully_connected(hid, 2, activation='softmax')
sgd = tflearn.SGD(learning_rate=0.04, lr_decay=0.96, decay_step=1000)
net = tflearn.regression(output_layer, optimizer=sgd, loss='categorical_crossentropy')
model = tflearn.DNN(net, tensorboard_verbose=3, tensorboard_dir='tfdir')
try:
model.fit(X_train, Y_train, n_epoch=5, validation_set=(X_val, Y_val), batch_size=100, show_metric=True, run_id="Imdb")
except KeyboardInterrupt as e:
print("Stopped by user")
无论我调整超参数有多少,训练,验证和测试的准确性始终最大约为0.65。
my_review = "This movie sucks"
my_review_enc = fitter.transform([my_review])
my_review_enc_pad = pad_sequences(my_review_enc.toarray(), maxlen=100, value=0.)
prediction = model.predict(my_review_enc_pad)
prediction
如您所见,正面和负面的预测总是在50%
我在做什么错了?