我试图了解如何在逻辑回归交叉验证中计算最佳系数,其中“refit”参数为True。 如果我正确理解docs,则最佳系数是首先确定最佳正则化参数“C”的结果,即,在所有折叠上具有最高平均分数的C的值。然后,最佳系数只是在最佳C得分最高的折叠上计算的系数。我假设如果最大得分达到几倍,则这些折叠的系数将被平均以给出最佳系数(我在文档中没有看到如何处理这个案例)。
为了测试我的理解,我用两种不同的方式确定了最佳系数:
我从1.和2.得到的结果相似但不完全相同,所以我希望有人能指出我在这里做错了什么。 谢谢!
演示此问题的示例:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]
# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)
# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1',
refit=True, scoring='roc_auc',
solver='liblinear', random_state=0,
fit_intercept=False)
clf.fit(X_train_scaled, y_train)
########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")
########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]
paths = clf.coefs_paths_[1] # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")