完美的精度,召回率和f1得分,但预测错误

时间:2018-11-13 10:05:58

标签: python scikit-learn prediction

使用scikit-learn对二进制问题进行分类。变得完美classification_report(全1)。然而预测给出了0.36。怎么会呢?

我熟悉标签不平衡的情况。但是,我认为情况并非如此,因为f1和其他得分列以及混淆矩阵都表示完美得分。

# Set aside the last 19 rows for prediction.
X1, X_Pred, y1, y_Pred = train_test_split(X, y, test_size= 19, 
                shuffle = False, random_state=None)

X_train, X_test, y_train, y_test = train_test_split(X1, y1, 
         test_size= 0.4, stratify = y1, random_state=11)

clcv = DecisionTreeClassifier()
scorecv = cross_val_score(clcv, X1, y1, cv=StratifiedKFold(n_splits=4), 
                         scoring= 'f1') # to balance precision/recall
clcv.fit(X1, y1)
y_predict = clcv.predict(X1)
cm = confusion_matrix(y1, y_predict)
cm_df = pd.DataFrame(cm, index = ['0','1'], columns = ['0','1'] )
print(cm_df)
print(classification_report( y1, y_predict ))
print('Prediction score:', clcv.score(X_Pred, y_Pred)) # unseen data

输出:

confusion:
      0   1
0  3011   0
1     0  44

              precision    recall  f1-score   support
       False       1.00      1.00      1.00      3011
        True       1.00      1.00      1.00        44

   micro avg       1.00      1.00      1.00      3055
   macro avg       1.00      1.00      1.00      3055
weighted avg       1.00      1.00      1.00      3055

Prediction score: 0.36

1 个答案:

答案 0 :(得分:2)

问题在于您过度拟合。

有很多未使用的代码,所以让我们修剪一下

# Set aside the last 19 rows for prediction.
X1, X_Pred, y1, y_Pred = train_test_split(X, y, test_size= 19, 
                shuffle = False, random_state=None)

clcv = DecisionTreeClassifier()
clcv.fit(X1, y1)
y_predict = clcv.predict(X1)
cm = confusion_matrix(y1, y_Pred)
cm_df = pd.DataFrame(cm, index = ['0','1'], columns = ['0','1'] )
print(cm_df)
print(classification_report( y1, y_Pred ))
print('Prediction score:', clcv.score(X_Pred, y_Pred)) # unseen data

很明显,这里没有交叉验证,而较低的预测分数的明显原因是决策树分类器的过拟合。

使用交叉验证中的分数,您应该在那里直接看到问题。