在keras

时间:2018-05-29 11:45:32

标签: python-3.x keras

我有一个使用keras开发的CNN模型。我已将模型保存在硬盘中,出于预测目的,我将其重新加载到内存中。

##helper libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from subprocess import check_output
from collections import Counter
import gc
from keras.models import model_from_json
import h5py

#keras library
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from sklearn.preprocessing import LabelEncoder
import time
from keras import metrics
print('import done')

# load json and create model
json_file = open('C:\\Users\\user\\Downloads\\model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("C:\\Users\\user\\Downloads\\model.h5")
print("Loaded model from disk")

我有一个测试数据集,我在其上进行预测以计算准确度。这是在预处理之后完成的,如下所示

#fetch data
data = pd.read_csv('C:\\Users\\user\\Downloads\\model\\data.csv')
texts = data.texts.tolist()
# labels
le = LabelEncoder()
tags = le.fit_transform(data.tags.tolist())

#preprocess
num_max = 1000
tok = Tokenizer(num_words=num_max)
tok.fit_on_texts(texts)
mat_texts = tok.texts_to_matrix(texts,mode='count')
print(tags[:5])
print(mat_texts[:5])
print(tags.shape,mat_texts.shape)

[1 1 0 1 1]
[[0. 2. 2. ... 0. 0. 0.]
 [0. 9. 3. ... 0. 0. 0.]
 [0. 2. 2. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 2. 4. ... 0. 0. 0.]]
(400000,) (400000, 1000)

更多预处理代码

# for cnn preproces
max_len = 100
cnn_texts_seq = tok.texts_to_sequences(texts)
print(cnn_texts_seq[0])
cnn_texts_mat = sequence.pad_sequences(cnn_texts_seq,maxlen=max_len)
print(cnn_texts_mat[0])
print(cnn_texts_mat.shape)

[23, 16, 31, 94, 21, 45, 26, 7, 1, 31, 7, 79, 3, 22, 5, 8, 94, 11, 137, 2, 3, 127, 81, 6, 52, 110, 10, 4, 33, 6, 210, 44, 233, 91, 4, 128, 38, 34, 10, 1, 8, 94, 38, 154, 25, 2, 651, 38, 26, 7, 8, 9, 4, 94, 10, 21, 20, 180, 97, 124, 129, 6, 224, 9, 38, 871, 44, 3, 239, 8, 53, 619, 425, 581, 467, 134, 512, 26, 163, 72, 13, 12, 925]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  23
  16  31  94  21  45  26   7   1  31   7  79   3  22   5   8  94  11 137
   2   3 127  81   6  52 110  10   4  33   6 210  44 233  91   4 128  38
  34  10   1   8  94  38 154  25   2 651  38  26   7   8   9   4  94  10
  21  20 180  97 124 129   6 224   9  38 871  44   3 239   8  53 619 425
 581 467 134 512  26 163  72  13  12 925]
(400000, 100)

当我最终预测使用下面的处理功能时

predicted_values = np.round(loaded_model.predict(cnn_texts_mat))

如果我打印前10个预测值,我得到如下

print(predicted_values[0:10])
[[1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]]

这是迄今为止的好。然而,挑战在于我试图预测单个文档。

以下是前8个文件的概要

0         Great CD: My lovely Pat has one of ...
1         One of the best game music soundtra...
2         Batteries died within a year ...: I bought th...
3         works fine, but Maha Energy is bett...
4         Great for the non-audiophile: Revie...
5         DVD Player crapped out after one year: I also...
6         Incorrect Disc: I love the style of this, but...
7         DVD menu select problems: I cannot scroll thr...

我想仅预测第一份文件

single_text = '''Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I\'m in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life\'s hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"'''

当我传递单个文档时,我得到一个不同大小的数组

num_max = 1000
tok = Tokenizer(num_words=num_max)
tok.fit_on_texts(single_text)
mat_texts = tok.texts_to_matrix(single_text,mode='count')
print(tags[0])
print(mat_texts)
print(tags[0].shape,mat_texts.shape)

1
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
() (533, 1000)

问题1)最后一个值应为(1,1000)而不是(533,1000)。不知道为什么会这样

进一步进行预处理,

# for cnn preproces
max_len = 100
cnn_texts_seq = tok.texts_to_sequences(single_text)
print(cnn_texts_seq[0])
cnn_texts_mat = sequence.pad_sequences(cnn_texts_seq,maxlen=max_len)
print(cnn_texts_mat[0])
print(cnn_texts_mat.shape)

我得到了

[14]
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0 14]
(533, 100)

2)阵列的大小也不同。而不是(1,100)它是(533,100)

3)在预测时,正如预期的那样,我获得了大小(533,1)的数组,而不是只有一个值。不知道为什么

predicted_values = loaded_model.predict(cnn_texts_mat)
print(predicted_values.shape)
(533, 1)

无论我上面写的是什么,我如何简单地获得一个文档的预测值而不是如此巨大的数组。

编辑:我尝试使用列表对象

single_text = ['''Great for the non-audiophile: Reviewed quite a bit of the combo players and was hesitant due to unfavorable reviews and size of machines. I am weaning off my VHS collection, but don't want to replace them with DVD's. This unit is well built, easy to setup and resolution and special effects (no progressive scan for HDTV owners) suitable for many people looking for a versatile product.Cons- No universal remote.''']

def pre_process(single_text,max_len = 100,num_max = 1000):
    tok = Tokenizer(num_words=num_max)
    tok.fit_on_texts(single_text)
    mat_texts = tok.texts_to_matrix(single_text,mode='count')
    cnn_texts_seq = tok.texts_to_sequences(single_text)
    cnn_texts_mat = sequence.pad_sequences(cnn_texts_seq,maxlen=max_len)
    return cnn_texts_mat

print(np.round(loaded_model.predict(pre_process([texts[i]]))))
1

这是匹配的,但是有许多其他值与原始预测函数的值不匹配。有什么帮助吗?

1 个答案:

答案 0 :(得分:1)

由于代码的工作部分使用了文本列表,因此所有预测都应该使用列表,即使这些列表只包含一个元素。 (在keras模型中,这将表示第一个维度== 1的输入数组。第一个维度是批量大小)

所以,我建议你将single_text列入清单:

single_text = ['''Great CD: My lovely <removed for visibility> was that singing ?"''']

现在,我也认为你不应该再次适应这个标记器了。

它应该识别所有单词与训练中完全相同的方式。如果你再次适应它,它将开始产生不同的标记,你的预测自然会完全不同。

你应该删除重复的行:

tok = Tokenizer(num_words=num_max)
tok.fit_on_texts(single_text)

并使用原始的标记器(如果可能的话,将它与模型一起保存可能是个好主意)。