如何使用Catboost提高模型的准确性

时间:2020-03-12 06:15:55

标签: machine-learning classification catboost

我正在尝试使用catboost为员工薪水数据集建立二进制分类模型。我已经尝试了最大程度的调优,但仍然只能获得87%的精度,如何将其提高到〜98%或更高?

目标是预测课程。

这是数据集和代码:

数据集:

http://archive.ics.uci.edu/ml/datasets/Adult

代码:

    from catboost import CatBoostClassifier

    import pandas as pd
    import numpy as np
    from numpy import arange
    from tqdm import tqdm_notebook as tqdm
    import matplotlib.pyplot as plt
    plt.style.use('ggplot')
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    from sklearn import metrics, preprocessing

    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')

    X = train.drop('class', axis=1)
    y = train['class']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)
    test_data = test.drop('class', axis=1)
    print(y_train.value_counts())
    print(y_test.value_counts())


    #provide categorical features to catboost
    cat_features = ['workclass','education','marital-status','occupation','relationship','race','sex','native-country']

    best_params = {
            'bagging_temperature': 0.5,
            'depth': 8,
            'iterations': 1000,
            'l2_leaf_reg': 25,
            'learning_rate': 0.05,
            'sampling_frequency': 'PerTreeLevel',
            'leaf_estimation_method': 'Gradient',
            'random_strength': 0.8,
            'boosting_type': 'Ordered',
            'feature_border_type': 'MaxLogSum',
            'l2_leaf_reg': 50,
            'max_ctr_complexity': 2,
            'fold_len_multiplier': 2
    }

    model_cat = CatBoostClassifier(**best_params,
                               loss_function='Logloss',
                               eval_metric='AUC',
                               nan_mode='Min',
                               thread_count=8,
                               task_type='CPU',
                               verbose=True)


    model_cat.fit(X_train, y_train,
                              eval_set=(X_test, y_test),
                              cat_features=cat_features,
                              verbose_eval=300,
                              early_stopping_rounds=500,
                              use_best_model=True,
                              plot=False)


    model_cat.save_model("catmodel")

    ##Predictions
    cat_predictions = model_cat.predict_proba(test_data)[:, 1]
    cat_predictions_df = pd.DataFrame({'class': cat_predictions})

这是整个调整后我得到的最大精度。

     Test set class grouping:
     <=50K    7451
     >50K     2318

      Predicted
       Y    N
    [[7037  799]
     [ 414 1519]]

    Precision:  0.9444369883237149
    Recall:  0.8980347115875447
    Accuracy:  0.8758317125601393
    F1-score:  0.9206515339831229

所以在这里我仍然有414个FP和799个FN,它们的结果都不好。.从文档中尝试了所有best_params以及它们的不同值。

2 个答案:

答案 0 :(得分:1)

我希望您可以使用 catboost.CatBoostClassifier 库中提供的 grid_search 方法对其进行进一步的调整。

有关更多参考,请找到URL:https://catboost.ai/docs/concepts/python-reference_catboost_grid_search.html

答案 1 :(得分:1)

@ MJ209,这是网格搜索参数和准确性。

    params = {'depth':[3,1,2,6,4,5,7,8,9,10],
              'iterations':[250,100,500,1000],
              'learning_rate':[0.03,0.001,0.01,0.1,0.2,0.3],
              'l2_leaf_reg':[3,1,5,10,100],
              'border_count':[32,5,10,20,50,100,200],
              'bagging_temperature':[0.03,0.09,0.25,0.75],
              'random_strength':[0.2,0.5,0.8],
              'max_ctr_complexity':[1,2,3,4,5] }


    model = CatBoostClassifier()
    grid_search_result = model.grid_search(params,
                                           X=train_set,
                                           y=train_label,
                                           cv=5,
                                           partition_random_seed=3,
                                           stratified=True)

运行36小时后,我已在中间停止了程序,因为它表明还需要24天才能完成,下面是程序的最后跟踪信息:

150625:损失:0.2800770最佳:0.2746405(5345)总计:7m 22s剩余:24d 1h 2m 7s

相关问题