Question

我正在尝试为我的 logit 模型绘制学习曲线，但我收到以下错误，即使我使用输入参数的形状调整了 array = np.linspace(0, dataframe.shape[0])。会不会有某种数据丢失？因为我看到预期值和输入数据之间超过 225k 行，但我不知道在哪里。

def get_learning_curves(dataframe, model, X, y):
#check for overfitting
    
    array = np.linspace(0, dataframe.shape[0])
    train_sizes = array.astype(int)
    # Get train scores (R2), train sizes, and validation scores using `learning_curve`
    train_sizes, train_scores, test_scores = learning_curve(
        estimator=model, X=X, y=y, train_sizes=train_sizes, cv=5)

    # Take the mean of cross-validated train scores and validation scores
    train_scores_mean = np.mean(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    plt.plot(train_sizes, train_scores_mean, label = 'Training score')
    plt.plot(train_sizes, test_scores_mean, label = 'Test score')
    plt.ylabel('r2 score', fontsize = 14)
    plt.xlabel('Training set size', fontsize = 14)
    plt.title('Learning curves', fontsize = 18, y = 1.03)
    plt.legend()
   
    return plt.show()

get_learning_curves(pre, LogisticRegression(), X_pre, y_pre)

pre.shape
>>>(125578, 23)

我收到错误：

ValueError: train_sizes has been interpreted as absolute numbers of training samples and 
must be within (0, 100462], but is within [0, 125578].

Answer 1

您收到的错误消息一目了然，意思是：

<块引用>

训练样本的绝对数必须至少为1且不能超过100462

那是因为 learning_curve 使用了交叉验证。显然，交叉验证会保留 1 个 k 折叠用于测试模型。 n 是样本的绝对数量，这意味着 n/k 样本将被保留用于测试模型。相反，这意味着您最多可以指定 n - n/k 作为子集样本大小来训练模型。这就是为什么您的案例中的边界是 125578 - 125578/5 = 100462。

要解决您的问题，您必须指定正确的间隔以从代码中选择样本大小。如果您想对大小使用绝对数字，实现此目的的一种方法可能是更改：

array = np.linspace(0, dataframe.shape[0])

到

array = np.linspace(1, int(dataframe.shape[0]*0.8))

此解决方案将尊重 5 倍交叉验证方法的边界。

学习曲线拟合

1 个答案: