使用f1-score与多个参数进行交叉验证

时间:2019-07-03 13:50:37

标签: python scikit-learn

我正在尝试使用SelectKBest进行特征选择,并使用f1-score进行二进制分类的最佳树深度。我创建了一个计分功能,以选择最佳功能并评估网格搜索。当分类器尝试拟合训练数据时,会弹出“ 呼叫()缺少1个必需的位置参数:'y_true'”的错误。

#Define scorer
f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)

#initialize tree and Select K-best features for classifier   
kbest = SelectKBest(score_func=f1_scorer, k=all)
clf = DecisionTreeClassifier(random_state=0)

#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])

#initialize a grid search with features to be optimized
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)

gs.fit(X_train,y_train)

#order best selected features into a single variable
selector = SelectKBest(score_func=f1_scorer, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)  

On the fit line I get a TypeError: __call__() missing 1 required positional argument: 'y_true'.

1 个答案:

答案 0 :(得分:0)

问题出在您用于score_func的{​​{1}}中。 SelectKBest是一个函数,它接受两个数组 X y ,并返回一对数组(分数,p值)或带有分数的单个数组,但是在您的代码已将可调用的score_func作为f1_scorer的输入,而score_func仅将y_truey_pred并计算出f1 score。您可以将chi2f_classifmutual_info_classif之一用作分类任务的score_func。另外,在k的参数SelectKBest中还有一个小错误,应该是"all"而不是all。我修改了包含这些更改的代码,

from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import f_classif  
from sklearn.metrics import f1_score, make_scorer
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_classes=2,
                       n_informative=4, weights=[0.7, 0.3],
                       random_state=0)

f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)

#initialize tree and Select K-best features for classifier   
kbest = SelectKBest(score_func=f_classif)
clf = DecisionTreeClassifier(random_state=0)

#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)
gs.fit(X_train,y_train)
gs.best_params_

输出

  

{'dt__max_depth':6,6,'kbest__k':9}

还如下修改您的最后两行:

selector = SelectKBest(score_func=f_classif, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)  

希望这会有所帮助!