使用交叉验证的模型评估错误 - average_precision_score

时间:2021-03-15 20:28:01

标签: python scikit-learn classification

所以我使用 balance_accuracy 作为我的评分运行了以下随机森林网格搜索:

# define the parameter grid
param_grid = [
        {'criterion': ['gini', 'entropy'],   # try different purity metrics in building the trees
         'max_depth': [2, 5, 8, 10, 15, 20],    # vary the max_depth of the trees in the ensemble
        'n_estimators': [10, 50, 100, 200],   # vary the number of trees in the ensemble
        'max_samples': [0.4, 0.7, 0.9]}     # vary how many samples each tree is built with
]

# setup the Random Forest model with all arguments as default
model = RandomForestClassifier()

# pass the model and the param_grid to the grid search, and use 5 folds with 'accuracy' as the scoring measure
grid_search = GridSearchCV(model, param_grid, cv = 5, scoring = 'balanced_accuracy')

# fit the grid search to the training set
grid_search.fit(X_smote, y_smote)

# return best model
rf_best = grid_search.best_estimator_

# return the hyperparameter values of the best model
print(grid_search.best_params_)

# use the best model to make predictions on the test set
y_pred = rf_best.predict(X_test)

# compute the test set accuracy of the best model
print("accuracy: ", accuracy_score(y_test,y_pred))
print("f1: ", f1_score(y_test, y_pred, pos_label='Listed'))
print("precision: ", precision_score(y_test, y_pred, pos_label='Listed'))
print("recall: ", recall_score(y_test, y_pred, pos_label='Listed'))

产生以下分数:


{'criterion': 'gini', 'max_depth': 20, 'max_samples': 0.7, 'n_estimators': 100}
accuracy:  0.6547231270358306
f1:  0.7612612612612613
precision:  0.9260273972602739
recall:  0.6462715105162524

我想使用 average_precision 评分参数,因为这更适合我的用例,因此我将语法更新为以下内容:

from sklearn.metrics import average_precision_score
# define the parameter grid
param_grid = [
        {'criterion': ['gini', 'entropy'],   # try different purity metrics in building the trees
         'max_depth': [2, 5, 8, 10, 15, 20],    # vary the max_depth of the trees in the ensemble
        'n_estimators': [10, 50, 100, 200],   # vary the number of trees in the ensemble
        'max_samples': [0.4, 0.7, 0.9]}     # vary how many samples each tree is built with
]

# setup the Random Forest model with all arguments as default
model = RandomForestClassifier()

# pass the model and the param_grid to the grid search, and use 5 folds with 'accuracy' as the scoring measure
grid_search = GridSearchCV(model, param_grid, cv = 5, scoring = 'average_precision')

# fit the grid search to the training set
grid_search.fit(X_smote, y_smote)

# return best model
rf_best = grid_search.best_estimator_

# return the hyperparameter values of the best model
print(grid_search.best_params_)

# use the best model to make predictions on the test set
y_pred = rf_best.predict(X_test)

# compute the test set accuracy of the best model
print("accuracy: ", accuracy_score(y_test,y_pred))
print("f1: ", f1_score(y_test, y_pred, pos_label='Listed'))
print("precision: ", precision_score(y_test, y_pred, pos_label='Listed'))
print("recall: ", recall_score(y_test, y_pred, pos_label='Listed'))

但是我收到以下错误:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\_ranking.py in average_precision_score(y_true, y_score, average, pos_label, sample_weight)
    211         if len(present_labels) == 2 and pos_label not in present_labels:
    212             raise ValueError("pos_label=%r is invalid. Set it to a label in "
--> 213                              "y_true." % pos_label)
    214     average_precision = partial(_binary_uninterpolated_average_precision,
    215                                 pos_label=pos_label)

ValueError: pos_label=1 is invalid. Set it to a label in y_true.

为什么我不能像使用balanced_accuracy那样在我的代码中使用average_precision。有什么我应该做的事情吗?

1 个答案:

答案 0 :(得分:1)

不知道您的数据集是什么样的,也不知道代码中的错误究竟在哪里。多余的部分太多。

如果目的是使用所述的平均精度分数,那么您可以使用 make_scorer,假设您的标签是二进制的,0/1 如下例所示:

from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = [
        {'criterion': ['gini', 'entropy'],   
         'max_depth': [2,5],    
        'n_estimators': [200],   
        'max_samples': [0.8]}]


X, y = make_blobs(n_samples=[80,20], centers=None, n_features=5,
cluster_std = 3.5,random_state=0)     

model = RandomForestClassifier(random_state=42)
grid_search_acc = GridSearchCV(model, param_grid, cv = 5, scoring = 'balanced_accuracy')

grid_search_acc.fit(X, y)

grid_search_acc.best_score_
0.75625

平衡精度有效,使其适用于平均精度:

from sklearn.metrics import average_precision_score, make_scorer
ap_score = make_scorer(precision_score, greater_is_better=True, pos_label=1)

grid_search_prec = GridSearchCV(model, param_grid, cv = 5, scoring = ap_score)
grid_search_prec.fit(X, y)

grid_search_prec.best_score_
0.9333333333333332