使用scikit-learn对二进制问题进行分类。变得完美classification_report
(全1)。然而预测给出了0.36
。怎么会呢?
我熟悉标签不平衡的情况。但是,我认为情况并非如此,因为f1
和其他得分列以及混淆矩阵都表示完美得分。
# Set aside the last 19 rows for prediction.
X1, X_Pred, y1, y_Pred = train_test_split(X, y, test_size= 19,
shuffle = False, random_state=None)
X_train, X_test, y_train, y_test = train_test_split(X1, y1,
test_size= 0.4, stratify = y1, random_state=11)
clcv = DecisionTreeClassifier()
scorecv = cross_val_score(clcv, X1, y1, cv=StratifiedKFold(n_splits=4),
scoring= 'f1') # to balance precision/recall
clcv.fit(X1, y1)
y_predict = clcv.predict(X1)
cm = confusion_matrix(y1, y_predict)
cm_df = pd.DataFrame(cm, index = ['0','1'], columns = ['0','1'] )
print(cm_df)
print(classification_report( y1, y_predict ))
print('Prediction score:', clcv.score(X_Pred, y_Pred)) # unseen data
输出:
confusion:
0 1
0 3011 0
1 0 44
precision recall f1-score support
False 1.00 1.00 1.00 3011
True 1.00 1.00 1.00 44
micro avg 1.00 1.00 1.00 3055
macro avg 1.00 1.00 1.00 3055
weighted avg 1.00 1.00 1.00 3055
Prediction score: 0.36
答案 0 :(得分:2)
问题在于您过度拟合。
有很多未使用的代码,所以让我们修剪一下
# Set aside the last 19 rows for prediction.
X1, X_Pred, y1, y_Pred = train_test_split(X, y, test_size= 19,
shuffle = False, random_state=None)
clcv = DecisionTreeClassifier()
clcv.fit(X1, y1)
y_predict = clcv.predict(X1)
cm = confusion_matrix(y1, y_Pred)
cm_df = pd.DataFrame(cm, index = ['0','1'], columns = ['0','1'] )
print(cm_df)
print(classification_report( y1, y_Pred ))
print('Prediction score:', clcv.score(X_Pred, y_Pred)) # unseen data
很明显,这里没有交叉验证,而较低的预测分数的明显原因是决策树分类器的过拟合。
使用交叉验证中的分数,您应该在那里直接看到问题。