在Keras对模型进行毒性挑战训练后,预测的准确性很差。我不确定自己做错了什么,但是在训练期间的准确性非常好,约为0.98。
我的培训方式
import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
train = pd.read_csv('train.csv')
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_train = train["comment_text"]
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
inp = Input(shape=(maxlen, ))
embed_size = 128
x = Embedding(max_features, embed_size)(inp)
x = LSTM(60, return_sequences=True,name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
batch_size = 32
epochs = 2
print(X_t[0])
model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)
model.save("m.hdf5")
这就是我的预测
model = load_model('m.hdf5')
list_sentences_train = np.array(["I love you Stackoverflow"])
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
print(X_t)
print(model.predict(X_t))
输出
[[1.97086316e-02 9.36032447e-05 3.93966911e-03 5.16672269e-04 3.67353857e-03 1.28102733e-03]]
答案 0 :(得分:1)
在推断(即预测)阶段,您应该使用在模型训练期间使用的相同的预处理步骤。因此,您不应创建新的Tokenizer
实例并将其适合您的测试数据。相反,如果您希望以后可以使用同一模型进行预测,则除了该模型之外,还必须保存从训练数据获得的所有统计信息,例如Tokenizer
实例中的词汇表。因此,它是这样的:
import pickle
# building and training of the model as you have done ...
# store all the data we need later: model and tokenizer
model.save("m.hdf5")
with open('tokenizer.pkl', 'wb') as handler:
pickle.dump(tokenizer, handler)
现在处于预测阶段:
import pickle
model = load_model('m.hdf5')
with open('tokenizer.pkl', 'rb') as handler:
tokenizer = pickle.load(handler)
list_sentences_train = ["I love you Stackoverflow"]
# use the the same tokenizer instance you used in training phase
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
print(model.predict(X_t))