将评分函数从sklearn.metrics传递给GridSearchCV

时间:2016-08-04 05:12:21

标签: python scikit-learn grid-search

GridSearchCV's documentations声明我可以通过评分功能。

  

评分:字符串,可调用或无,默认=无

我想使用原生accuracy_score作为评分函数。

所以这是我的尝试。进口和一些数据:

import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn import neighbors

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([0, 1, 0, 0, 0, 1])

现在,当我在没有评分功能的情况下使用k-fold交叉验证时,一切都按预期工作:

parameters = {
    'n_neighbors': [2, 3, 4],
    'weights':['uniform', 'distance'],
    'p': [1, 2, 3]
}
model = neighbors.KNeighborsClassifier()
k_fold = KFold(len(Y), n_folds=6, shuffle=True, random_state=0)
clf = GridSearchCV(model, parameters, cv=k_fold)  # TODO will change
clf.fit(X, Y)

print clf.best_score_

但是当我将线路更改为

clf = GridSearchCV(model, parameters, cv=k_fold, scoring=accuracy_score) # or accuracy_score()

我收到错误:ValueError: Cannot have number of folds n_folds=10 greater than the number of samples: 6.在我看来并不代表真正的问题。

在我看来,问题是accuracy_score没有遵循签名scorer(estimator, X, y),这是在文档中写的

那么我该如何解决这个问题?

2 个答案:

答案 0 :(得分:3)

如果您将scoring=accuracy_score更改为scoring='accuracy',则可以正常工作(请参阅http://scikit-learn.org/stable/modules/model_evaluation.html的doco,了解您可以通过这种方式按姓名使用的完整的得分者列表。)

从理论上讲,你应该能够像你一样尝试传递自定义评分功能,但我猜你是对的,accuracy_score没有合适的API。< / p>

答案 1 :(得分:0)

这里是一个示例,它使用加权Kappa作为GridSearchCV的评分指标,用于简单的随机森林模型。对我来说,关键的学习是使用“ make_scorer”功能中与计分器相关的参数。

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import cohen_kappa_score, make_scorer


kappa_scorer = make_scorer(cohen_kappa_score,weights="quadratic")
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_features':  range(2,10), # try features from 2 to 10
    'min_samples_leaf': [3, 4, 5],
    'n_estimators' : [100,300,500],
    'max_depth':  [5]
    }
# Create a based model
random_forest = RandomForestClassifier(class_weight ="balanced_subsample",random_state=1)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = random_forest, param_grid = param_grid, 
                         cv = 5, n_jobs = -1, verbose = 2, scoring = kappa_scorer) # search for best model using roc_auc

# Fit the grid search to the data
grid_search.fit(final_tr, yTrain)