我正在尝试使用catboost为员工薪水数据集建立二进制分类模型。我已经尝试了最大程度的调优,但仍然只能获得87%的精度,如何将其提高到〜98%或更高?
目标是预测课程。
这是数据集和代码:
数据集:
http://archive.ics.uci.edu/ml/datasets/Adult
代码:
from catboost import CatBoostClassifier
import pandas as pd
import numpy as np
from numpy import arange
from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import metrics, preprocessing
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X = train.drop('class', axis=1)
y = train['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)
test_data = test.drop('class', axis=1)
print(y_train.value_counts())
print(y_test.value_counts())
#provide categorical features to catboost
cat_features = ['workclass','education','marital-status','occupation','relationship','race','sex','native-country']
best_params = {
'bagging_temperature': 0.5,
'depth': 8,
'iterations': 1000,
'l2_leaf_reg': 25,
'learning_rate': 0.05,
'sampling_frequency': 'PerTreeLevel',
'leaf_estimation_method': 'Gradient',
'random_strength': 0.8,
'boosting_type': 'Ordered',
'feature_border_type': 'MaxLogSum',
'l2_leaf_reg': 50,
'max_ctr_complexity': 2,
'fold_len_multiplier': 2
}
model_cat = CatBoostClassifier(**best_params,
loss_function='Logloss',
eval_metric='AUC',
nan_mode='Min',
thread_count=8,
task_type='CPU',
verbose=True)
model_cat.fit(X_train, y_train,
eval_set=(X_test, y_test),
cat_features=cat_features,
verbose_eval=300,
early_stopping_rounds=500,
use_best_model=True,
plot=False)
model_cat.save_model("catmodel")
##Predictions
cat_predictions = model_cat.predict_proba(test_data)[:, 1]
cat_predictions_df = pd.DataFrame({'class': cat_predictions})
这是整个调整后我得到的最大精度。
Test set class grouping:
<=50K 7451
>50K 2318
Predicted
Y N
[[7037 799]
[ 414 1519]]
Precision: 0.9444369883237149
Recall: 0.8980347115875447
Accuracy: 0.8758317125601393
F1-score: 0.9206515339831229
所以在这里我仍然有414个FP和799个FN,它们的结果都不好。.从文档中尝试了所有best_params以及它们的不同值。
答案 0 :(得分:1)
我希望您可以使用 catboost.CatBoostClassifier 库中提供的 grid_search 方法对其进行进一步的调整。
有关更多参考,请找到URL:https://catboost.ai/docs/concepts/python-reference_catboost_grid_search.html
答案 1 :(得分:1)
@ MJ209,这是网格搜索参数和准确性。
params = {'depth':[3,1,2,6,4,5,7,8,9,10],
'iterations':[250,100,500,1000],
'learning_rate':[0.03,0.001,0.01,0.1,0.2,0.3],
'l2_leaf_reg':[3,1,5,10,100],
'border_count':[32,5,10,20,50,100,200],
'bagging_temperature':[0.03,0.09,0.25,0.75],
'random_strength':[0.2,0.5,0.8],
'max_ctr_complexity':[1,2,3,4,5] }
model = CatBoostClassifier()
grid_search_result = model.grid_search(params,
X=train_set,
y=train_label,
cv=5,
partition_random_seed=3,
stratified=True)
运行36小时后,我已在中间停止了程序,因为它表明还需要24天才能完成,下面是程序的最后跟踪信息:
150625:损失:0.2800770最佳:0.2746405(5345)总计:7m 22s剩余:24d 1h 2m 7s