发生预测时准确性不佳

时间:2019-05-26 12:11:29

标签: python tensorflow machine-learning keras nlp

在Keras对模型进行毒性挑战训练后,预测的准确性很差。我不确定自己做错了什么,但是在训练期间的准确性非常好,约为0.98。

我的培训方式

import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

train = pd.read_csv('train.csv')


list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_train = train["comment_text"]

max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)

maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

inp = Input(shape=(maxlen, ))

embed_size = 128
x = Embedding(max_features, embed_size)(inp)
x = LSTM(60, return_sequences=True,name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)

model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

batch_size = 32
epochs = 2
print(X_t[0])
model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

model.save("m.hdf5")

这就是我的预测

model = load_model('m.hdf5')

list_sentences_train = np.array(["I love you Stackoverflow"])

max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)

maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

print(X_t)

print(model.predict(X_t))

输出

  

[[1.97086316e-02 9.36032447e-05 3.93966911e-03 5.16672269e-04       3.67353857e-03 1.28102733e-03]]

1 个答案:

答案 0 :(得分:1)

在推断(即预测)阶段,您应该使用在模型训练期间使用的相同的预处理步骤。因此,您不应创建新的Tokenizer实例并将其适合您的测试数据。相反,如果您希望以后可以使用同一模型进行预测,则除了该模型之外,还必须保存从训练数据获得的所有统计信息,例如Tokenizer实例中的词汇表。因此,它是这样的:

import pickle

# building and training of the model as you have done ...

# store all the data we need later: model and tokenizer    
model.save("m.hdf5")
with open('tokenizer.pkl', 'wb') as handler:
    pickle.dump(tokenizer, handler)

现在处于预测阶段:

import pickle

model = load_model('m.hdf5')
with open('tokenizer.pkl', 'rb') as handler:
    tokenizer = pickle.load(handler)

list_sentences_train = ["I love you Stackoverflow"]

# use the the same tokenizer instance you used in training phase
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

print(model.predict(X_t))