我正在使用LSTM模型进行文本分类,我在验证数据上的准确性达到98%,但是当我提交时它获得0分,请帮助我该怎么做,我是NLP的初学者。 我有这样的数据
train.head()
id category text
0 959 0 5573 1189 4017 1207 4768 8542 17 1189 5085 5773
1 994 0 6315 7507 6700 4742 1944 2692 3647 4413 6700
2 995 0 5015 8067 5335 1615 7957 5773
3 996 0 2925 7199 1994 4647 7455 5773 4518 2734 2807 8...
4 997 0 7136 1207 6781 237 4971 3669 6193
我在这里应用令牌生成器:
from keras.preprocessing.text import Tokenizer
max_features = 1000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(X_train))
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
我在这里应用序列填充:
from keras.preprocessing import sequence
max_words = 30
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
print(X_train.shape,X_test.shape)
这是我的模特:
batch_size = 64
epochs = 5
max_features = 1000
embed_dim = 100
num_classes = train['category'].nunique()
model = Sequential()
model.add(Embedding(max_features, embed_dim, input_length=X_train.shape[1]))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100, dropout=0.2))
model.add(Dense(num_classes, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 30, 100) 100000
_________________________________________________________________
conv1d_3 (Conv1D) (None, 30, 32) 9632
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 15, 32) 0
_________________________________________________________________
conv1d_4 (Conv1D) (None, 15, 32) 3104
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 7, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100) 53200
_________________________________________________________________
dense_2 (Dense) (None, 2) 202
=================================================================
Total params: 166,138
Trainable params: 166,138
Non-trainable params: 0
_________________________________________________________________
None
这是我的时代
model_history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=batch_size, verbose=1)
Train on 2771 samples, validate on 693 samples
Epoch 1/5
2771/2771 [==============================] - 2s 619us/step - loss: 0.2816 - acc: 0.9590 - val_loss: 0.1340 - val_acc: 0.9668
Epoch 2/5
2771/2771 [==============================] - 1s 238us/step - loss: 0.1194 - acc: 0.9664 - val_loss: 0.0809 - val_acc: 0.9668
Epoch 3/5
2771/2771 [==============================] - 1s 244us/step - loss: 0.0434 - acc: 0.9843 - val_loss: 0.0258 - val_acc: 0.9899
Epoch 4/5
2771/2771 [==============================] - 1s 236us/step - loss: 0.0150 - acc: 0.9958 - val_loss: 0.0423 - val_acc: 0.9899
Epoch 5/5
2771/2771 [==============================] - 1s 250us/step - loss: 0.0064 - acc: 0.9984 - val_loss: 0.0532 - val_acc: 0.9899
在我将预测功能应用于测试数据之后: 我的提交文件是这样的:
submission.head()
id category
0 3729 0.999434
1 3732 0.999128
2 3761 0.999358
3 5 0.996779
4 7 0.998702
我的实际提交文件是这样的:
submission.head()
id category
0 3729 1
1 3732 1
2 3761 1
3 5 1
4 7 1
答案 0 :(得分:0)
看起来您需要将结果转换成文字!当您标记和填充时,会将单词变成数字。您只需要将它们改回来即可!例如:
transformed_category = []
for cat in submission['category']:
transformed_category.append(tokenizer.word_index(cat))
为了教育起见...之所以这样做,是因为数学实际上不能真正地在字符串上执行-至少,它不像在数字上那样容易。因此,只要您在神经网络中输入文本,就需要先将它们转换为数字表示形式,然后再输入网络。向量化器(您的标记化器所做的)和“一次性”或“分类”是最常见的方法。无论哪种情况,一旦将结果从网络上撤回,您都可以将其转换为人类的文字。 :)
嗨!所以,是的,我正看着歪斜的列。由于Sigmoid只能在0和1之间选择,因此您获得的值为1(或实际上很接近),但是,由于您的损失是 binary _crossentropy,因此看起来像您想要的那样。使用sigmoid激活时,大值将渐近地接近1。所以我想说,您需要重新考虑输出层。似乎您要发送数字数组,并且似乎要找出一个范围大于0到1的类别,因此请考虑使用softmax作为将Y数据转换为categoricals最终输出激活,并将损失更改为categorical_crossentropy