I'm using python
and scikit-learn
to do some cross validation testing. Currently I am splitting a pandas dataframe into a training set (X_train
, y_train
) and testing set (X_test
, y_test
), then performing a random 3-fold cross validation on the training set and using the final parameters from my grid search to project a final model on my test set:
from sklearn import cross_validation, ensemble, grid_search
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X_data, y_data, train_size=.5, random_state=1
)
N_cv = y_train.shape[0]
kf = cross_validation.KFold(n=N_cv, n_folds=folds, shuffle=True, random_state=None)
#Generate Gradient Boosting Regression
#Set up grid search parameters
gb_learning_grid = exponent(2, range(-3, 2, 1) )
gb_estimators_grid = [100, 200, 300]
gb_minleaf_grid = [25, 50, 75]
gradientboost_grid = ensemble.GradientBoostingRegressor()
gradientboost_param = {'learning_rate':gb_learning_grid, 'n_estimators':gb_estimators_grid, 'min_samples_leaf':gb_minleaf_grid}
#Stage 1 Grid Search
stage1_gb_model = grid_search.GridSearchCV(estimator=gradientboost_grid, param_grid=gradientboost_param, n_jobs=jobs, cv=kf)
gradientboost_CV1 = stage1_gb_model.fit(X=X_train, y=y_true)
best_estimators_gb = gradientboost_CV1.best_params_['n_estimators']
best_learning_gb = gradientboost_CV1.best_params_['learning_rate']
best_minleaf_gb = gradientboost_CV1.best_params_['min_samples_leaf']
#Stage 2 Grid Search
gradientboost_grid = ensemble.GradientBoostingRegressor(
min_samples_leaf=best_minleaf_gb, n_estimators=best_estimators_gb)
stage2_learning_gb = drange(best_learning_gb-0.025, best_learning_gb+0.025, 0.00625)
stage2_learning_gb = [float(x) for x in stage2_learning_gb]
stage2_gb_param = {'learning_rate':stage2_learning_gb}
stage2_gb_model = grid_search.GridSearchCV(estimator=gradientboost_grid, param_grid=stage2_gb_param, n_jobs=jobs, cv=kf, scoring=scoring)
gradientboost_CV2 = stage2_gb_model.fit(X=X_train, y=y_true)
best_learning_gb = gradientboost_CV2.best_params_['learning_rate']
#Generate Primary Model
final_gbr = ensemble.GradientBoostingRegressor(n_estimators=best_estimators_gb, learning_rate=best_learning_gb, min_samples_leaf=best_minleaf_gb)
final_fit = final_gbr.fit(X_train, y_true)
final_predict = final_fit.predict(X_test)
So being able to perform these types of randomized k-fold grid searches is cool, but is there a native way within the sklearn
library to grid search on a specific set of data. To be more precise with my code above, is there a native way within sklearn
to develop models on X_train
, y_train
using the parameters given where the resulting best parameters from the grid search are determined by the resulting models' fit on a specific data set and not a randomized k-fold of X_train
, y_train
?