学习曲线的训练量

时间:2019-11-01 07:43:56

标签: python keras neural-network

我想知道我申请的learning_curve()的结果:

X_train1_be.shape
> (1360, 2)
y_train1_be.shape
> (1360, 2)

train_sizes, train_scores, test_scores = learning_curve(grid_best
                                                        , X_train1_be
                                                        , y_train1_be
                                                        , n_jobs=n_jobs
                                                        , scoring = 'neg_mean_squared_error'
                                                        , cv=TimeSeriesSplit(n_splits = 5)
                                                        , verbose=2
                                                        , shuffle = False
                                                        , train_sizes = [1
                                                                         , round(len(X_train1_be)/10)
                                                                         , round(len(X_train1_be)/5)
                                                                         , round(len(X_train1_be)/3)
                                                                         , round(len(X_train1_be)/2)
                                                                         , round(len(X_train1_be)/1)
                                                                        ]
                                                        )

但这会导致

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-178-9216e6224b3b> in <module>
     12                                                                          , round(len(X_train1_be)/3)
     13                                                                          , round(len(X_train1_be)/2)
---> 14                                                                          , round(len(X_train1_be)/1)
     15                                                                         ]
     16                                                         )

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in learning_curve(estimator, X, y, groups, train_sizes, cv, scoring, exploit_incremental_learning, n_jobs, pre_dispatch, verbose, shuffle, random_state, error_score)
   1257     # use the first 'n_max_training_samples' samples.
   1258     train_sizes_abs = _translate_train_sizes(train_sizes,
-> 1259                                              n_max_training_samples)
   1260     n_unique_ticks = train_sizes_abs.shape[0]
   1261     if verbose > 0:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _translate_train_sizes(train_sizes, n_max_training_samples)
   1341                              % (n_max_training_samples,
   1342                                 n_min_required_samples,
-> 1343                                 n_max_required_samples))
   1344 
   1345     train_sizes_abs = np.unique(train_sizes_abs)

ValueError: train_sizes has been interpreted as absolute numbers of training samples and must be within (0, 230], but is within [1, 1360].

相反,以下工作原理:

grid_best = grid_result.best_estimator_
train_sizes, train_scores, test_scores = learning_curve(grid_best
                                                        , X_train1_be
                                                        , y_train1_be
                                                        , n_jobs=n_jobs
                                                        , scoring = 'neg_mean_squared_error'
                                                        , cv=TimeSeriesSplit(n_splits = 5)
                                                        , verbose=2
                                                        , shuffle = False
                                                        , train_sizes = np.linspace(0.001, 1, 10))

> [learning_curve] Training set sizes: [  1  25  51  76 102 127 153 178 204 230]

根据此link,它应该首先起作用:

  

确定训练集的大小   首先,我们要确定要用于生成学习曲线的训练集大小。   最小值为1。最大值由   训练集中的实例。我们的训练集有9568个实例,所以   最大值是9568。但是,我们尚未将   验证集。我们将以80:20的比例进行操作,最后以   训练集7654个实例(80%),以及验证集1914   实例(20%)。鉴于我们的训练集将有7654个实例,   我们可以用来生成学习曲线的最大值是7654。   对于我们的情况,在这里,我们使用以下六个大小:

     

train_sizes = [1,100,500,2000,5000,7654]

1 个答案:

答案 0 :(得分:0)

似乎这是前一段时间已经提出的问题:github.com/scikit-learn/scikit-learn/issues/7834 意思是,目前尚不可能,而且似乎情况不会很快改变。

对我来说,一个规避方法是将数据集相乘,以使第一个保全包含整个数据集。