我有数据框,并对分类列进行了get_dummies运算,因此为每个分类列的每个类别生成了名称为“ columnName_cellValues”的新列,并建立了模型并保存。
#Load data
df = pd.read_csv('df.csv',sep=',',decimal='.',header=0)
#encode target column
df['class'] = LabelEncoder().fit_transform(df['class'])
#filter categorical column names
cat_columns = df.dtypes[df.dtypes == "object"].index
#get_dummies on it
df = pd.get_dummies(df, columns=cat_columns, drop_first=True)
Now built the model say randomForest and pickled it
以后
I load the model and got a test data which is only one record, Here categorical columns will have one of the category, so to do the predict
How should I map the column names of the model and the test data? Because here I don't have the training data, I have only model and test data.
示例:训练数据的列“ COLOR”以红色,绿色,蓝色为值,当我们获取虚拟变量时,我们将获得3列,分别为COLOR_red,COLOR_green,COLOR_blue。
现在在测试数据上,如果我有值为“ red”的“ COLOR”列,则需要在test_data中创建一个列为COLOR_red,并将值分配为1,另两个列为零,我应该如何做有多个类别的多个列?
按顺序使用OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features=cat_columns[0],sparse=False)
df = onehotencoder.fit_transform(df)
我正在关注错误
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:392: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
File "<ipython-input-23-b7547a4fe6b8>", line 1, in <module>
QBE_clean = onehotencoder.fit_transform(df)
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 511, in fit_transform
self._handle_deprecations(X)
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 396, in _handle_deprecations
sel[np.asarray(self.categorical_features)] = True
IndexError: arrays used as indices must be of integer (or boolean) type