最后获取CrossValidation拆分报告

时间:2019-05-25 22:24:31

标签: python machine-learning

我正在尝试将嵌套的CV管道转换为类似@Heavy Breathing --->问题的格式 Reference link 他的代码表明:

outer_loop_accuracy_scores = []
inner_loop_won_params = []
inner_loop_accuracy_scores = []

# Looping through the outer loop, feeding each training set into a GSCV as the inner loop
for train_index,test_index in outer_kf.split(features,target):

    GSCV = GridSearchCV(estimator=model,param_grid=params,cv=inner_kf)

    # GSCV is looping through the training data to find the best parameters. This is the inner loop
    GSCV.fit(features[train_index],target[train_index])

    # The best hyper parameters from GSCV is now being tested on the unseen outer loop test data.
    pred = GSCV.predict(features[test_index])

    # Appending the "winning" hyper parameters and their associated accuracy score
    inner_loop_won_params.append(GSCV.best_params_)
    outer_loop_accuracy_scores.append(accuracy_score(target[test_index],pred))
    inner_loop_accuracy_scores.append(GSCV.best_score_)

for i in zip(inner_loop_won_params,outer_loop_accuracy_scores,inner_loop_accuracy_scores):
    print i

print 'Mean of outer loop accuracy score:',np.mean(outer_loop_accuracy_scores)

但是我在管道中使用了与onehotencoder集成的嵌套cv方法

params = {
        'max_depth':np.linspace(1, 32, 8, endpoint=True),
        'n_estimators': [1, 2, 4, 8, 16, 32, 64, 200],
        'min_samples_split': np.linspace(0.1, 1.0, 5, endpoint=True),
        'min_samples_leaf':np.linspace(0.1, 0.5, 5, endpoint=True)
        }
rf = RandomForestClassifier(random_state = 23)    
grid_search = GridSearchCV(rf, params, cv=3, verbose=10)

pipe = make_pipeline(OneHotEncoder(sparse = True, handle_unknown='ignore'), grid_search)

cv = StratifiedKFold(n_splits = 5, random_state = 23, shuffle = False)

roc_auc = cross_val_score(pipe, df_x, df_y.values.ravel(), scoring = 'roc_auc', cv=cv)
accuracy = cross_val_score(pipe, df_x, df_y.values.ravel(), scoring = 'accuracy', cv=cv)


print("---%0.1f minutes---" %((timeit.default_timer()-start_time)/60))
print("roc_auc = {}, accuracy = {}".format(np.mean(roc_auc),np.mean(accuracy)))

现在,我想将其转换为上面的结构,以获取有关拆分的更多详细信息,外部拆分roc_auc和每个拆分的准确性列表以及内部gridsearchcv的best_params和best_score列表。

我使用了下面的代码,但出现了以下错误 “ [Int64Index([78,79,80,82,83,84,85,86,87,88,\ n ... \ n 422,423,424,425,426,427,428,429, 430,431],\ n dtype ='int64',length = 344)]位于[列]“

rf = RandomForestClassifier(random_state = 23)
ohe = OneHotEncoder(sparse=True, handle_unknown = 'ignore')
outer_kf = StratifiedKFold(n_splits=5,shuffle=False,random_state=1)
inner_kf = StratifiedKFold(n_splits=3,shuffle=False,random_state=2)


params = {
        'max_depth':np.linspace(1, 32, 8, endpoint=True),
        'n_estimators': [1, 2, 4, 8, 16, 32, 64, 200],
        'min_samples_split': np.linspace(0.1, 1.0, 5, endpoint=True),
        'min_samples_leaf':np.linspace(0.1, 0.5, 5, endpoint=True)
        }

outer_loop_accuracy_scores = []
outer_loop_roc_auc_scores = []
inner_loop_won_params = []
inner_loop_accuracy_scores = []
for train_index,test_index in outer_kf.split(df_x,df_y.values.ravel()):

    GSCV = GridSearchCV(estimator=rf,param_grid=params,cv=inner_kf, verbose = 0)
    ohe_x = ohe.fit_transform(df_x[train_index]).to_array()
    ohe_x_test = ohe.transform(df_x[test_index]).to_array()

    # GSCV is looping through the training data to find the best parameters. This is the inner loop
    GSCV.fit(ohe_x,df_y[train_index])

    # The best hyper parameters from GSCV is now being tested on the unseen outer loop test data.
    pred = GSCV.predict(ohe_x_test)

    # Appending the "winning" hyper parameters and their associated accuracy score
    inner_loop_won_params.append(GSCV.best_params_)
    outer_loop_accuracy_scores.append(accuracy_score(df_y[test_index],pred))
    outer_loop_roc_auc_scores.append(roc_auc_score(df_y[test_index],pred))
    inner_loop_accuracy_scores.append(GSCV.best_score_)

for i in zip(inner_loop_won_params,outer_loop_roc_auc_scores,outer_loop_accuracy_scores,inner_loop_accuracy_scores):
    print(i)

print('Mean of outer loop accuracy score:',np.mean(outer_loop_accuracy_scores))
print('Mean of outer loop roc_auc score:',np.mean(outer_loop_roc_auc_scores))
print("---%0.1f minutes---" %((timeit.default_timer()-start_time)/60))

如何将我的结构与此代码行集成在一起? 我可以在循环中使用管道,还是应该丢弃管道? 我如何使用Onehotencoder?由于出现不同的功能值,因此出现列错误。

***似乎正在发生问题,因为df_x对我来说是一个DataFrame。我用 df_x.values和df_y.values.ravel(),现在可以使用了。

我现在可以确定有错误并且可以正常工作,但是我仍然不得不问相同的逻辑和流水线问题。

0 个答案:

没有答案