SKLEARN //将GridsearchCV与列转换和管道结合起来

时间:2020-06-11 19:03:22

标签: scikit-learn pipeline gridsearchcv

我正在努力进行一个机器学习项目,在这个项目中我试图将其结合起来:

  • 一个sklearn列变换,可将不同的变换器应用于我的数字和分类特征
  • 应用我不同的转换器和估计器的管道
  • 使用GridSearchCV搜索最佳参数。

只要我在管道中手动填写不同变压器的参数,代码就可以正常工作。 但是,一旦我尝试传递不同值的列表以在我的gridsearch参数中进行比较,就会收到各种无效的参数错误消息。

这是我的代码:

首先,我将特征分为数字和分类

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)

然后,我为数字和分类特征创建2个不同的预处理管道:

numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),OneHotEncoder(handle_unknown='ignore'))

我将两者组合到另一个管道中,设置了参数,然后运行我的GridSearchCV代码

model=make_pipeline(preprocessor, LinearRegression() )

params={
    'columntransformer__numerical_pipeline__knnimputer__n_neighbors':[1,2,3,4,5,6,7]
}

grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=10)
cv = KFold(n_splits=5)
all_accuracies = cross_val_score(grid, X, y, cv=cv,scoring='r2')

我尝试了不同的方法来声明参数,但从未找到合适的方法。我总是收到“无效参数”错误消息。

能否请您帮助我了解出了什么问题?

非常感谢您的支持,请多加注意!

1 个答案:

答案 0 :(得分:1)

我假设您可能已经将preprocessor定义如下,

preprocessor = Pipeline([('numerical_pipeline',numerical_pipeline),
                        ('cat_pipeline', cat_pipeline)])

然后您需要按以下方式更改参数名称:

pipeline__numerical_pipeline__knnimputer__n_neighbors

但是,代码还有其他几个问题:

  1. 您无需在执行cross_val_score后致电GridSearchCV。对于每种超参数组合,GridSearchCV本身的输出将具有交叉验证结果。

  2. 当数据具有字符串数据时,
  3. KNNImputer将不起作用。您需要在cat_pipeline之前申请num_pipeline

完整示例:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd  # doctest: +SKIP
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.nan],
                  'rating': [5, 3, 4, 5]})  # doctest: +SKIP

y = [1,0,1,1]

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)

numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),
                            OneHotEncoder(handle_unknown='ignore', sparse=False))
preprocessor = Pipeline([('cat_pipeline', cat_pipeline),
                        ('numerical_pipeline',numerical_pipeline)])
model=make_pipeline(preprocessor, LinearRegression() )

params={
    'pipeline__numerical_pipeline__knnimputer__n_neighbors':[1,2]
}


grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=2)

grid.fit(X, y)