scikit-learn GridSearchCV无法与随机森林一起正常工作

时间:2018-05-04 17:16:39

标签: machine-learning scikit-learn random-forest grid-search

我有随机森林模型的网格搜索实现。

<table *ngIf="visibleAttachments().length > 0">
    <thead>
      <tr>
        <th>File</th>
        <th>Description</th>
        <th>Date</th>
      </tr>
    </thead>
    <tbody>
      <tr *ngFor="let attachment of visibleAttachments()">  
        <td>{{ attachment.file }}</td>
        <td>{{ attachment.description }}</td>
        <td>{{ attachment.date }}</td>
      </tr>
    </tbody>      
</table>

在我将其用于此网格搜索之前,我已将完全相同的数据集用于许多其他任务,因此数据不应存在任何问题。另外,为了测试目的,我首先使用LinearRegression来查看整个管道是否顺利运行,它的工作原理。然后我切换到RandomForestRegressor并设置一个非常少量的估算器来再次测试它。他们发生了一件非常奇怪的事情,我会附上详细的信息。性能显着下降,我不知道发生了什么。没有理由花30分钟+来运行一次小网格搜索。

train_X, test_X, train_y, test_y = train_test_split(features, target, test_size=.10, random_state=0)
# A bit performance gains can be obtained from standarization
train_X, test_X = standarize(train_X, test_X)

tuned_parameters = [{
    'n_estimators': [5],
    'criterion': ['mse', 'mae'],
    'random_state': [0]
}]

scores = ['neg_mean_squared_error', 'neg_mean_absolute_error']
for n_fold in [5]:
    for score in scores:
        print("# Tuning hyper-parameters for %s with %d-fold" % (score, n_fold))
        start_time = time.time()
        print()

        # TODO: RandomForestRegressor
        clf = GridSearchCV(RandomForestRegressor(verbose=2), tuned_parameters, cv=n_fold,
                           scoring=score, verbose=2, n_jobs=-1)
        clf.fit(train_X, train_y)
        ... Rest omitted

上面的日志打印几秒钟,然后事情似乎从这里开始......

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
building tree 2 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.3s
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.3s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
building tree 2 of 5
building tree 3 of 5
building tree 4 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.3s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.5s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.6s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5

这些线路的成本超过20分钟。

BTW,对于每次GridSearchCV运行,线性回归的成本不到1秒。

你知道为什么性能下降那么多吗?

任何建议和评论都表示赞赏。谢谢。

1 个答案:

答案 0 :(得分:1)

尝试为RandomForestRegressor设置max_depth。这应该减少装配时间。默认情况下为max_depth=None

例如:

tuned_parameters = [{
    'n_estimators': [5],
    'criterion': ['mse', 'mae'],
    'random_state': [0],
    'max_depth': [4],
}]

修改:此外,默认情况下RandomForestRegressorn_jobs=1。它将使用此设置一次构建一个树。尝试设置n_jobs=-1

此外,您可以指定多个指标,而不是将scoring参数循环到GridSearchCV。执行此操作时,您还必须指定要GridSearchCV选择的指标作为refit的值。然后,您可以在拟合后访问cv_results_字典中的所有分数。

    clf = GridSearchCV(RandomForestRegressor(verbose=2),tuned_parameters, 
                       cv=n_fold, scoring=scores, refit='neg_mean_squared_error',
                       verbose=2, n_jobs=-1)

    clf.fit(train_X, train_y)
    results = clf.cv_results_
    print(np.mean(results['mean_test_neg_mean_squared_error']))
    print(np.mean(results['mean_test_neg_mean_absolute_error']))

http://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py