功能与OneHotEncoder不匹配,同时预测单个数据实例

时间:2018-03-15 11:41:37

标签: machine-learning scikit-learn random-forest

onehotencoder如何用于单值预测

  

错误Msg- ValueError:模型的要素数必须与输入匹配。模型n_features为1261,输入n_features为16

我正在训练文本数据的随机森林分类器。我正在计算此文本数据的每个实例的16个功能。由于所有这16个变量都被分类,我对这16个变量中的每一个使用OneHotEncoder来对它们进行编码。这导致1261列训练矩阵。我还为这些做了功能缩放。我还对我的训练数据进行了80:20 train:test分割,并应用预测器来获取混淆矩阵,分类报告。我还在本地磁盘上以pickle格式保存分类器,标准缩放器变量,onehotencoder变量。

现在我想在新的单独文件中创建预测器的服务(REST)。此API将使用.pkl格式的已保存模型并预测新单个文本值的值 - 基本上给出其预测的类名和相应的置信度分数。

我面临的问题是:当我对这个单个文本值进行编码时,我得到一个包含16个特征的向量。它没有被编码为1261个功能。因此,当我在新文本上对此分类器运行predict()函数时,它会给我以下错误:

  

%(self.n_features_,n_features))   ValueError:模型的要素数必须与输入匹配。模型n_features为1261,输入n_features为16

当编码矩阵与先前训练的分类器的大小不匹配时,如何使用反序列化的pkl模型来预测单个实例?如何解决这个问题。

修改:发布代码段和异常堆栈:

# Loading the .pkl files used in training
with open('model.pkl', 'rb') as f_model:
    classifier = pickle.load(f_model) # trained classifier model

with open('labelencoder_file.pkl', 'rb') as f_lblenc:
    label_encoder = pickle.load(f_lblenc) # label encoder object used in training

with open('encoder_file.pkl', 'rb') as f_onehotenc:
    onehotencoder = pickle.load(f_onehotenc) # onehotencoder object used in training

with open('sc_file.pkl', 'rb') as f_sc:
    scaler = pickle.load(f_sc) # standard scaler object used in training

X = df_features # df_features is the dataframe containing the computed feature values. It has 16 columns as 16 features have been computed for the new value
X.values[:, 0] = label_encoder.fit_transform(X.values[:, 0])
X.values[:, 1] = label_encoder.fit_transform(X.values[:, 1])
# This is repeated  till X.values[:, 15] as all features are categorical

X = onehotencoder.fit_transform(X).toarray()
X = scaler.fit_transform(X)
print(X.shape) # This prints (1, 16), thus showing that encoding has not worked properly

y_pred = classifier.predict(X) # This throws the exception

跟踪(最近一次呼叫最后一次):

文件" /home/Test/api.py",第256行,在api_func()中     y_pred = classifier.predict(X)

File" /usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py" ;,第538行,预测     proba = self.predict_proba(X)

文件" /usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py",第578行,在predict_proba中     X = self._validate_X_predict(X)

文件" /usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py",第357行,在_validate_X_predict中     return self.estimators_ [0] ._ validate_X_predict(X,check_input = True)

文件" /usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py",第384行,在_validate_X_predict中     %(self.n_features_,n_features))

ValueError:模型的要素数必须与输入匹配。模型n_features为1261,输入n_features为16

1 个答案:

答案 0 :(得分:0)

在此处发布修改后的代码以解决问题

'''Loading .pkl files that were persisted during training'''
with open('model.pkl', 'rb') as f_model:
    classifier = pickle.load(f_model) # trained classifier model

with open('labelencoder00.pkl', 'rb') as f_lblenc00:
    label_encoder00 = pickle.load(f_lblenc00) # LabelEncoder() object that was used for encoding the first categorical variable
with open('labelencoder01.pkl', 'rb') as f_lblenc01:
    label_encoder01 = pickle.load(f_lblenc01) # LabelEncoder() object that was used for encoding the second categorical variable

with open('onehotencoder.pkl', 'rb') as f_onehotenc:
    onehotencoder = pickle.load(f_onehotenc) # OneHotEncoder object that was used in training


X = df_features # df_features is the dataframe containing the computed feature values
X.values[:, 0] = label_encoder00.transform(X.values[:, 0])
X.values[:, 1] = label_encoder01.transform(X.values[:, 1])

X = onehotencoder.transform(X).toarray()

pred = classifier.predict(X)