在学习使用Pipelines和GridSearchCV时,我尝试使用支持向量回归器集成随机森林回归器。单独使用GridSearchCV,两者的得分都在90%左右,我被卡住了。但是将SVR放在管道中的随机森林之前,它跃升至92%。
我找不到任何这样的例子,所以我认为它不是非常有用,不正确,或者说有更好的方法。将不胜感激任何指导。
我创建了一个使用SKLearn的波士顿住房与Lasso和随机森林的快速示例。结合使得“mean_test_score”从大约62%增加到65%。相关代码段如下,完整笔记本位于:http://nbviewer.jupyter.org/gist/Blebg/ce279345456dc706d2deddcfab49a984
class Lasso_t(Lasso): #Give Lasso a transform function
def transform(self, x):
return super(Lasso_t, self).predict(x).reshape(-1, 1)
#The pipe creates a Lasso regression prediction that Random Forest gets as a variable
pipe = Pipeline(steps = [
('std_scaler', StandardScaler()),
('union', FeatureUnion([('reg', Lasso_t(alpha = 0.2)),
('keep_X', FunctionTransformer(lambda x : x))])),
('rf', RandomForestRegressor(n_estimators = 100))])
params = dict(
rf__min_samples_leaf = [1,5,10],
rf__max_features = ['log2','sqrt'])
grid_search = GridSearchCV(pipe, param_grid=params, cv = 5)
grid_search.fit(X,y)
pd.DataFrame(grid_search.cv_results_).sort_values(by = 'rank_test_score').head(3)
答案 0 :(得分:0)
您可能会寻找 sklearn.ensemble.VotingRegressor
,它可以通过取平均值来组合两个回归模型。
这是一个让您入门的示例:
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
# Make fake data
X, y = make_regression(n_samples=1_000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)
pipe = Pipeline([('scl', StandardScaler()),
('vr', VotingRegressor([('svr', SVR()), ('rfr', RandomForestRegressor())]))
])
search_space = [{'vr__rfr__min_samples_leaf': [1, 5, 10]}]
gs_cv = GridSearchCV(estimator=pipe,
param_grid=search_space,
n_jobs=-1)
gs_cv.fit(X_train, y_train)
gs_cv.predict(X_test)