如何使用sklearn获得K-fold横向验证的平均分数

时间:2017-11-13 05:59:51

标签: scikit-learn cross-validation

我使用sklearn使用K-fold应用决策树,有人可以帮助我显示它的平均分数。以下是我的代码:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix,classification_report

dta=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data")

X=dta.drop("whether he/she donated blood in March 2007",axis=1)

X=X.values # convert dataframe to numpy array

y=dta["whether he/she donated blood in March 2007"]

y=y.values # convert dataframe to numpy array

kf = KFold(n_splits=10)

KFold(n_splits=10, random_state=None, shuffle=False)

clf_tree=DecisionTreeClassifier()

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf=clf_tree.fit(X_train,y_train)
    print("classification_report_tree", 
           classification_report(y_test,clf_tree.predict(X_test)))

2 个答案:

答案 0 :(得分:2)

如果您只想要准确性,那么您只需使用cross_val_score()

即可
kf = KFold(n_splits=10)
clf_tree=DecisionTreeClassifier()
scores = cross_val_score(clf_tree, X, y, cv=kf)

avg_score = np.mean(score_array)
print(avg_score)

这里cross_val_score将把原始的X和y作为输入(不分成列车和测试)。 cross_val_score会自动将它们分成训练和测试,使模型适合列车数据并对测试数据进行评分。这些分数将在scores变量中返回。

因此,当您有10个折叠时,scores变量将返回10个分数。然后你可以平均得到它。

答案 1 :(得分:1)

You can try Precision_reacll_fscore_support metric from sklearn and then get average the results for each fold per class. I am assuming here that you need the scores average per class.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import GridSearchCV,cross_val_score

dta=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data")

X=dta.drop("whether he/she donated blood in March 2007",axis=1)

X=X.values # convert dataframe to numpy array

y=dta["whether he/she donated blood in March 2007"]

y=y.values # convert dataframe to numpy array

kf = KFold(n_splits=10)

KFold(n_splits=10, random_state=None, shuffle=False)

clf_tree=DecisionTreeClassifier()

score_array =[]
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf=clf_tree.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    score_array.append(precision_recall_fscore_support(y_test, y_pred, average=None))

avg_score = np.mean(score_array,axis=0)
print(avg_score)

#Output:
#[[  0.77302466   0.30042282]
# [  0.81755068   0.22192344]
# [  0.79063779   0.24414489]
# [ 57.          17.8       ]]

Now to get precision of class 0, you can use avg_score[0][0]. The recall can be accessed by the second row (i.e. for class 0, it is avg_score[1][0]), while the fscore and support can be accessed from 3rd and 4th row respectively.