从训练有素的LSTM模型进行预测

时间:2019-12-22 13:23:14

标签: python tensorflow machine-learning keras lstm

我已经使用LSTM根据收集到的一些数据训练了一个模型。我想分类为犬科动物或猫科动物。

我正在尝试预测这样的文本字符串

json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# load weights into new model
loaded_model.load_weights("lstm.hd5")
print("Loaded model from disk")


text_to_predict = ['A 2‐year‐old male domestic shorthair cat was presented for a progressive history of abnormal posture, behavior, and mentation. Menace response was absent bilaterally, and generalized tremors were identified on neurological examination. A neuroanatomical diagnosis of diffuse brain dysfunction was made. A neurodegenerative disorder was suspected. Magnetic resonance imaging findings further supported the clinical suspicion. Whole‐genome sequencing of the affected cat with filtering of variants against a database of unaffected cats was performed. Candidate variants were confirmed by Sanger sequencing followed by genotyping of a control population. Two homozygous private (unique to individual or families and therefore absent from the breed‐matched controlled population) protein‐changing variants in the major facilitator superfamily domain 8 (MFSD8) gene, a known candidate gene for neuronal ceroid lipofuscinosis type 7 (CLN7), were identified. The affected cat was homozygous for the alternative allele at both variants. This is the first report of a pathogenic alteration of the MFSD8 gene in a cat strongly suspected to have CLN7.']




MAX_SEQUENCE_LENGTH = 352
MAX_NB_WORDS = 2000

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ')
seq = tokenizer.texts_to_sequences(text_to_predict)
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = loaded_model.predict(padded)
labels = ['canine', 'feline']
print(pred, labels[np.argmax(pred)])

但是,无论我选择对字符串进行何种分类,所有预测都将返回相同的结果。

[[0.5212073 0.47879276]]犬

我也不确定为什么必须将MAX_SEQUENCE_LENGTH设置为352,因为我的模型似乎期望使用该大小的数组。将其设置为任何其他值将返回错误

ValueError: Error when checking input: expected embedding_1_input to have shape (352,) but got array with shape (250,)

我的模型培训(通过参考)是通过此代码完成的。

data = pd.read_csv('data.csv')
data['Text'] = data['Text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

MAX_NB_WORDS = 2000
embed_dim = 128
lstm_out = 196

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ')
tokenizer.fit_on_texts(data['Text'].values)
X = tokenizer.texts_to_sequences(data['Text'].values)
X = pad_sequences(X)


model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

print('model string has been saved')

Y =  data[['canine','feline']]
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

batch_size = 32
model.fit(X_train, Y_train, epochs = 30, batch_size=batch_size, verbose = 2)

#save model for future use.
model.save('lstm.hd5')

任何帮助将不胜感激:D

1 个答案:

答案 0 :(得分:0)

从您的问题中,我了解到Model之后Training的预测正确,但是在加载Training之后ClassSaved Model相同。

我最近遇到了同样的问题,解决此问题的方法是将经过训练的Tokenizer的{​​{1}}保存在Model中并加载{{1} },当我们要在加载Pickle File之后执行Pickle File时。

Predictions保存在Saved Model文件中的代码:

Tokenizer

加载泡菜文件的代码:

Pickle

除了上面的代码,您的代码中还有其他一些观察结果:

  1. 建议在训练模型和对加载的模型执行预测时使用相同的填充。

因此,您可以从

更改代码

import pickle # saving with open('tokenizer.pickle', 'wb') as handle: pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('tokenizer.pickle', 'rb') as handle:
    tokenizer2 = pickle.load(handle)
  1. 在加载模型之前和之后,X = pad_sequences(X)X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH) 的值应该相同

  2. 建议在加载模型之前和之后执行相同的数据预处理步骤。因此,您也可以在加载模型后应用MAX_SEQUENCE_LENGTH函数。

该代码应该可以正常工作,如下所述:

MAX_NB_WORDS

已加载模型的修改代码如下所示:

(lambda x: re.sub('[^a-zA-z0-9\s]','',x))

如果这些更改没有为您提供所需的输出,请联系我们,我们将竭诚为您服务。

希望这会有所帮助。学习愉快!