如何在CatBoostClassifier.fit()之后获取评估指标?

时间:2018-05-25 17:17:42

标签: catboost

我培训了一个名为CatBoostClassifier.fit()的分类模型,同时提供了eval_set

现在,如何获取评估指标的最佳值,以及在培训期间实现的迭代次数?我可以通过在plot=True的调用中设置fit()来绘制信息,但是如何将其分配给变量?

当我训练模型调用cv()时,我可以这样做,因为cv()会返回所需信息。但是CatBoostClassifier.fit()并没有返回任何内容,相应的是文档。

这里是我用来拟合模型的代码片段:

model = CatBoostClassifier(
                           random_seed=42,
                           logging_level='Silent',
                           eval_metric='Accuracy'
                          )

model.fit(X_train,
          y_train,
          cat_features=cat_features_idxs,
          eval_set=(X_val, y_val),
          plot=True
         )

如果我改为使用cv(),我将如何设法获取所需信息:

cv_data = cv(Pool(X, y, cat_features = cat_features_idxs),
             model.get_params(),
             fold_count = 5,
             plot=True)

print('Validation accuracy (best average among cross-validation folds) is {} obtained at step {}.'.format(np.max(cv_data['test-Accuracy-mean']), np.argmax(cv_data['test-Accuracy-mean'])))

1 个答案:

答案 0 :(得分:1)

1)仅计算训练数据的分数:

https://stackoverflow.com/a/17954831

model = CatBoostClassifier(
                       random_seed=42,
                       logging_level='Silent',
                       eval_metric='Accuracy'
                      )

model.fit(X_train,
          y_train,
          cat_features=cat_features_idxs,
          eval_set=(X_val, y_val),
          plot=True
         )

train_score = model.score(X_train, y_train) # train (learn) score

val_score = model.score(X_val, y_val) # val (test) score

另一种方法是访问输出文件:

model = CatBoostClassifier(
                       random_seed=42,
                       logging_level='Silent',
                       eval_metric='Accuracy',
                       allow_writing_files=True
                      )

model.fit(X_train,
      y_train,
      cat_features=cat_features_idxs,
      eval_set=(X_val, y_val),
      plot=True
     )

import pandas as pd
test_error = pd.read_csv('catboost_info/test_error.tsv', sep='\t')
val_score = test_error.loc[test_error['Accuracy'] == test_error['Accuracy'].max()]['Accuracy'].values[0]
best_iter = int(test_error.loc[test_error['Accuracy'] == test_error['Accuracy'].min()]['iter'].values[0])
train_score = learn_error.loc[learn_error['iter'] == best_iter]['Accuracy'].values[0]

2)如果已安装熊猫,请添加as_pandas=True作为cv的参数,然后可以将cv_data作为数据框访问。例如cv_data['test-Accuracy-mean'].max()

https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_cv-docpage/

您还可以按上述方式访问输出文件,在这种情况下,每个折叠都会有一对文件夹。

希望这会有所帮助!