我使用GridSearchCV进行岭回归,但是使用matplotlib
来显示模型性能与正则化器(alpha)无法
有人可以帮忙吗?
我的代码:
from sklearn.datasets import fetch_california_housing
cal=fetch_california_housing()
X = cal.data
y = cal.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
param_grid = {'alpha': np.logspace(-3, 3, 13)}
print(param_grid)
grid = GridSearchCV(Ridge(normalize=True), param_grid, cv=10)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
import matplotlib.pyplot as plt
alphas = np.logspace(-3, 3, 13)
plt.semilogx(alphas, grid.fit(X_train, y_train), label='Train')
plt.semilogx(alphas, grid.fit(X_test, y_test), label='Test')
plt.legend(loc='lower left')
plt.ylim([0, 1.0])
plt.xlabel('alpha')
plt.ylabel('performance')
# the error code I got was "ValueError: x and y must have same first dimension"
基本上,我希望看到以下内容:
答案 0 :(得分:2)
你应该绘制得分,而不是grid.fit()
的结果。
首先使用return_train_score=True
:
grid = GridSearchCV(Ridge(normalize=True), param_grid, cv=10, return_train_score=True)
然后在拟合模型图之后如下:
plt.semilogx(alphas, grid.cv_results_['mean_train_score'], label='Train')
plt.semilogx(alphas, grid.cv_results_['mean_test_score'], label='Test')
plt.legend()
结果:
答案 1 :(得分:1)
在绘制使用GridSearch产生的模型选择性能时,通常可以绘制交叉验证折叠的测试和训练集的平均值和标准差。
还应注意确定在网格搜索中使用哪个评分标准来选择最佳模型。这通常是回归的R平方。
网格搜索返回一个字典(可通过.cv_results_
访问),其中包含每个折叠列车/测试分数的分数以及训练/测试每个折叠所花费的时间。还使用均值和标准偏差包括该数据的摘要。
PS。在较新版本的熊猫中,您需要包含return_train_score=True
PS.S.使用网格搜索时,模型选择不需要将数据拆分为训练/测试,因为网格搜索会自动拆分数据(cv = 10表示数据被拆分为10倍)
鉴于上述情况,我将代码修改为
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_california_housing
cal = fetch_california_housing()
X = cal.data
y = cal.target
param_grid = {'alpha': np.logspace(-3, 3, 13)}
print(param_grid)
grid = GridSearchCV(Ridge(normalize=True), param_grid,
cv=10, return_train_score=True, scoring='r2')
grid.fit(X, y)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
alphas = np.logspace(-3, 3, 13)
train_scores_mean = grid.cv_results_["mean_train_score"]
train_scores_std = grid.cv_results_["std_train_score"]
test_scores_mean = grid.cv_results_["mean_test_score"]
test_scores_std = grid.cv_results_["std_test_score"]
plt.figure()
plt.title('Model')
plt.xlabel('$\\alpha$ (alpha)')
plt.ylabel('Score')
# plot train scores
plt.semilogx(alphas, train_scores_mean, label='Mean Train score',
color='navy')
# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alphas,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.2,
color='navy')
plt.semilogx(alphas, test_scores_mean,
label='Mean Test score', color='darkorange')
# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alphas,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.2,
color='darkorange')
plt.legend(loc='best')
plt.show()