我正在研究回归问题,并希望评估使用不同标准化方法(StandardScaler
,RobustScaler
,Normalizer
,...)的效果。
稍后,我还要评估处理缺失数据(SimpleImputer
,IterativeImputer
)的不同方法。
这是我当前的设置。
# Create some dummy data
X = pd.DataFrame({
'x1': np.random.rand(1000)*123 - 83,
'x2': np.random.rand(1000)*23 + 34
})
y = X['x1'] * X['x2'] + 5 * X['x2'] - 9012 + np.random.rand(1000) * 1000
# Set up three pipelines with different scalers
pipe1 = Pipeline([
('scale', StandardScaler()),
('svr', svm.SVR())
])
pipe2 = Pipeline([
('scale', RobustScaler()),
('svr', svm.SVR())
])
pipe3 = Pipeline([
('scale', Normalizer()),
('svr', svm.SVR())
])
# SVR parameters for each pipeline
param_grid = [
{'svr__C': [1, 10, 100, 1000], 'svr__kernel': ['linear']},
{'svr__C': [1, 10, 100, 1000], 'svr__gamma': [0.001, 0.0001], 'svr__kernel': ['rbf']},
]
# Apply GridSearchCV and report.
grid_search = GridSearchCV(pipe1, param_grid, cv=5, n_jobs=-1).fit(X, y)
print('Best score ({:.2f}) was reached with params {}'.format(grid_search.best_score_, grid_search.best_params_))
grid_search = GridSearchCV(pipe2, param_grid, cv=5, n_jobs=-1).fit(X, y)
print('Best score ({:.2f}) was reached with params {}'.format(grid_search.best_score_, grid_search.best_params_))
grid_search = GridSearchCV(pipe3, param_grid, cv=5, n_jobs=-1).fit(X, y)
print('Best score ({:.2f}) was reached with params {}'.format(grid_search.best_score_, grid_search.best_params_))
困扰我的是,我必须为每个定标器定义一个单独的管道。所以我的问题是:有没有办法在网格搜索中包含不同的转换器(例如StandardScaler
,Normalizer
,...)?
理想情况下,我希望我的代码看起来像这样:
pipe = Pipeline(
# ???
)
param_grid = [
{'normalization_method':[StandardScaler, RobustScaler, Normalizer], 'svr__C': [1, 10, 100, 1000], 'svr__kernel': ['linear']},
{'normalization_method':[StandardScaler, RobustScaler, Normalizer], 'svr__C': [1, 10, 100, 1000], 'svr__gamma': [0.001, 0.0001], 'svr__kernel': ['rbf']},
]
grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1).fit(X, y)
print('Best score ({:.2f}) was reached with params {}'.format(grid_search.best_score_, grid_search.best_params_))
答案 0 :(得分:1)
这可能是一个令人费解的答案,但效果很好。这是设置:
from sklearn.preprocessing import FunctionTransformer, StandardScaler, RobustScaler, Normalizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)
我将使用 scikit-learn缩放器作为参数来创建我的定制缩放器。
class Scaler(BaseEstimator, TransformerMixin):
def __init__(self, scaler = StandardScaler()):
self.scaler = scaler
def fit(self, x, y=None):
return self.scaler.fit(x)
def transform(self, x):
return self.scaler.transform(x)
如您所见,它所做的只是复制常规scikit-learn
定标器的方法。唯一的区别是它的初始化方式。默认情况下,为方便起见,我设置了scaler = StandardScaler()
。
您可以执行以下操作:
scaler = Scaler(StandardScaler())
scaler.fit(X_train)
scaler.transform(X_train)[0:5]
>>> array([[-1.01827123, 1.2864604 , -1.39338902, -1.3621769 ],
[-0.7730102 , 2.43545215, -1.33550342, -1.49647603],
[-0.03722712, -0.78172474, 0.74837808, 0.92090833],
[ 0.20803391, 0.8268637 , 0.4010645 , 0.51801093],
[ 1.06644751, 0.13746866, 0.51683569, 0.3837118 ]])
这等同于
sl_scaler = StandardScaler()
sl_scaler.fit(X_train)
sl_scaler.transform(X_train)[0:5]
>>> array([[-1.01827123, 1.2864604 , -1.39338902, -1.3621769 ],
[-0.7730102 , 2.43545215, -1.33550342, -1.49647603],
[-0.03722712, -0.78172474, 0.74837808, 0.92090833],
[ 0.20803391, 0.8268637 , 0.4010645 , 0.51801093],
[ 1.06644751, 0.13746866, 0.51683569, 0.3837118 ]])
现在对您而言有趣的部分是您可以在管道中使用它:
pipe = Pipeline([
('scaler', Scaler()),
('lr', LogisticRegression())
])
pipe.fit(X_train,y_train)
pipe.score(X_test,y_test)
>>> 0.9473684210526315
并最终在GridSearch中:
param_grid = [
{'scaler__scaler':[StandardScaler(), RobustScaler(), Normalizer()], 'lr__C':
[1, 10, 100, 1000]},
{'scaler__scaler':[StandardScaler(), RobustScaler(), Normalizer()], 'lr__C':
[1, 10, 100, 1000]},
]
grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1).fit(X_train, y_train)
print(grid_search.score(X_test,y_test))
print(grid_search.best_params_)
>>> 0.9736842105263158
>>> {'lr__C': 1000, 'scaler__scaler': Normalizer(copy=True, norm='l2')}
这告诉您最好的缩放器是Normalizer
。
如果您要仔细检查以上内容,仍然可以运行以下命令:
pipe_check = Pipeline([
('scaler', Normalizer()),
('lr', LogisticRegression())
])
grid_search_check = GridSearchCV(pipe_check, param_grid = {'lr__C': [1, 10,
100, 1000]}, cv=5, n_jobs=-1).fit(X_train, y_train)
print(grid_search_check.score(X_test,y_test))
print(grid_search_check.best_params_)
>>> 0.9736842105263158
>>> {'lr__C': 1000}
这证实了我们使用定制定标器获得的结果!