使用GaussianNB分类器进行交叉验证的Python Naive Bayes

时间:2018-07-05 15:05:24

标签: python scikit-learn

我想将具有10倍分层交叉验证的朴素贝叶斯应用于我的数据,然后我想看看该模型对最初放置的测试数据的性能如何。 但是,我得到的结果(即预测结果和概率值y_pred_nb2y_score_nb2)与我在没有任何交叉验证的情况下运行代码的结果相同。 问题:我该如何纠正?

下面是代码,其中X_train占整个数据集的75%,而X_test则占25%。

from sklearn.model_selection import StratifiedKFold 

params = {}

#gridsearch searches for the best hyperparameters and keeps the classifier with the highest recall score
skf = StratifiedKFold(n_splits=10)

nb2 = GridSearchCV(GaussianNB(), cv=skf, param_grid=params)
%time nb2.fit(X_train, y_train)

# predict values on the test set
y_pred_nb2 = nb2.predict(X_test) 

print(y_pred_nb2)

# predicted probabilities on the test set
y_scores_nb2 = nb2.predict_proba(X_test)[:, 1]

print(y_scores_nb2)

2 个答案:

答案 0 :(得分:1)

首先GaussianNB仅接受priors作为参数,因此,除非您提前为模型设置了先验条件,否则您将没有网格搜索的任何内容。

此外,您的param_grid设置为空字典,可确保您仅将一个估计量与GridSearchCV匹配。这与在不使用网格搜索的情况下拟合估计量相同。例如(我使用MultinomialNB来显示超参数的使用):

from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
form sklearn.naive_bayes import Multinomial

skf = StratifiedKFold(n_splits=10)
params = {}
nb = MultinomialNB()
gs = GridSearchCV(nb, cv=skf, param_grid=params, return_train_score=True)

data = load_iris()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target)

gs.fit(x_train, y_train)

gs.cv_results_
{'mean_fit_time': array([0.]),
 'mean_score_time': array([0.]),
 'mean_test_score': array([0.85714286]),
 'mean_train_score': array([0.85992157]),
 'params': [{}],
 'rank_test_score': array([1]),
 'split0_test_score': array([0.91666667]),
 'split0_train_score': array([0.84]),
 'split1_test_score': array([0.75]),
 'split1_train_score': array([0.86]),
 'split2_test_score': array([0.83333333]),
 'split2_train_score': array([0.84]),
 'split3_test_score': array([0.91666667]),
 'split3_train_score': array([0.83]),
 'split4_test_score': array([0.83333333]),
 'split4_train_score': array([0.85]),
 'split5_test_score': array([0.91666667]),
 'split5_train_score': array([0.84]),
 'split6_test_score': array([0.9]),
 'split6_train_score': array([0.88235294]),
 'split7_test_score': array([0.8]),
 'split7_train_score': array([0.88235294]),
 'split8_test_score': array([0.8]),
 'split8_train_score': array([0.89215686]),
 'split9_test_score': array([0.9]),
 'split9_train_score': array([0.88235294]),
 'std_fit_time': array([0.]),
 'std_score_time': array([0.]),
 'std_test_score': array([0.05832118]),
 'std_train_score': array([0.02175538])}

nb.fit(x_train, y_train)
nb.score(x_test, y_test)
0.8157894736842105

gs.score(x_test, y_test)
0.8157894736842105

gs.param_grid = {'alpha': [0.1, 2]}
gs.fit(x_train, y_train)
gs.score(x_test, y_test)
0.8421052631578947

gs.cv_results_
{'mean_fit_time': array([0.00090394, 0.00049713]),
 'mean_score_time': array([0.00029924, 0.0003005 ]),
 'mean_test_score': array([0.86607143, 0.85714286]),
 'mean_train_score': array([0.86092157, 0.85494118]),
 'param_alpha': masked_array(data=[0.1, 2],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'params': [{'alpha': 0.1}, {'alpha': 2}],
 'rank_test_score': array([1, 2]),
 'split0_test_score': array([0.91666667, 0.91666667]),
 'split0_train_score': array([0.84, 0.83]),
 'split1_test_score': array([0.75, 0.75]),
 'split1_train_score': array([0.86, 0.86]),
 'split2_test_score': array([0.83333333, 0.83333333]),
 'split2_train_score': array([0.85, 0.84]),
 'split3_test_score': array([0.91666667, 0.91666667]),
 'split3_train_score': array([0.83, 0.81]),
 'split4_test_score': array([0.83333333, 0.83333333]),
 'split4_train_score': array([0.85, 0.84]),
 'split5_test_score': array([0.91666667, 0.91666667]),
 'split5_train_score': array([0.84, 0.84]),
 'split6_test_score': array([0.9, 0.9]),
 'split6_train_score': array([0.88235294, 0.88235294]),
 'split7_test_score': array([0.9, 0.8]),
 'split7_train_score': array([0.88235294, 0.88235294]),
 'split8_test_score': array([0.8, 0.8]),
 'split8_train_score': array([0.89215686, 0.89215686]),
 'split9_test_score': array([0.9, 0.9]),
 'split9_train_score': array([0.88235294, 0.87254902]),
 'std_fit_time': array([0.00030147, 0.00049713]),
 'std_score_time': array([0.00045711, 0.00045921]),
 'std_test_score': array([0.05651628, 0.05832118]),
 'std_train_score': array([0.02103457, 0.02556351])}

答案 1 :(得分:0)

这样的事情怎么样

import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB

#because only var_smoothing can be 'tuned'
#do a cross validation on different var_smoothing values

def cross_val(params):
    model = GaussianNB()
    model.set_params(**params)
    cv_results = cross_val_score(model, X_train, y_train,
                             cv = 10, #10 folds
                             scoring = "accuracy",
                             verbose = 2
                            )
    #return the mean of the 10 fold cross validation
    return cv_results.mean()

#baseline parameters
params = {
          "priors" : "None",
          "var_smoothing" : 1e-9
}
#create an list of var_smoothing to cross validate
steps = [1e-8, 1e-7, 1e-6, 1e-5, 1e-4]

#will contain the cv results
results = []
for step in steps:        
    params["var_smoothing"] = step        
    cv_result = cross_val(params)

    #save result
    results.append(cv_result)

#print results
#convert results to pandas dataframe for easier visualization
df = pd.DataFrame({"var_smoothing" : steps, "accuracy" : results})
#sort it
df_sorted = df.sort_values("accuracy", ascending=False)
#reset the index of the sorted dataframe
df_sorted.reset_index(inplace=True, drop=True)
df_sorted.head()