我正在尝试将嵌套的CV管道转换为类似@Heavy Breathing --->问题的格式 Reference link 他的代码表明:
outer_loop_accuracy_scores = []
inner_loop_won_params = []
inner_loop_accuracy_scores = []
# Looping through the outer loop, feeding each training set into a GSCV as the inner loop
for train_index,test_index in outer_kf.split(features,target):
GSCV = GridSearchCV(estimator=model,param_grid=params,cv=inner_kf)
# GSCV is looping through the training data to find the best parameters. This is the inner loop
GSCV.fit(features[train_index],target[train_index])
# The best hyper parameters from GSCV is now being tested on the unseen outer loop test data.
pred = GSCV.predict(features[test_index])
# Appending the "winning" hyper parameters and their associated accuracy score
inner_loop_won_params.append(GSCV.best_params_)
outer_loop_accuracy_scores.append(accuracy_score(target[test_index],pred))
inner_loop_accuracy_scores.append(GSCV.best_score_)
for i in zip(inner_loop_won_params,outer_loop_accuracy_scores,inner_loop_accuracy_scores):
print i
print 'Mean of outer loop accuracy score:',np.mean(outer_loop_accuracy_scores)
但是我在管道中使用了与onehotencoder集成的嵌套cv方法
params = {
'max_depth':np.linspace(1, 32, 8, endpoint=True),
'n_estimators': [1, 2, 4, 8, 16, 32, 64, 200],
'min_samples_split': np.linspace(0.1, 1.0, 5, endpoint=True),
'min_samples_leaf':np.linspace(0.1, 0.5, 5, endpoint=True)
}
rf = RandomForestClassifier(random_state = 23)
grid_search = GridSearchCV(rf, params, cv=3, verbose=10)
pipe = make_pipeline(OneHotEncoder(sparse = True, handle_unknown='ignore'), grid_search)
cv = StratifiedKFold(n_splits = 5, random_state = 23, shuffle = False)
roc_auc = cross_val_score(pipe, df_x, df_y.values.ravel(), scoring = 'roc_auc', cv=cv)
accuracy = cross_val_score(pipe, df_x, df_y.values.ravel(), scoring = 'accuracy', cv=cv)
print("---%0.1f minutes---" %((timeit.default_timer()-start_time)/60))
print("roc_auc = {}, accuracy = {}".format(np.mean(roc_auc),np.mean(accuracy)))
现在,我想将其转换为上面的结构,以获取有关拆分的更多详细信息,外部拆分roc_auc和每个拆分的准确性列表以及内部gridsearchcv的best_params和best_score列表。
我使用了下面的代码,但出现了以下错误 “ [Int64Index([78,79,80,82,83,84,85,86,87,88,\ n ... \ n 422,423,424,425,426,427,428,429, 430,431],\ n dtype ='int64',length = 344)]位于[列]“
rf = RandomForestClassifier(random_state = 23)
ohe = OneHotEncoder(sparse=True, handle_unknown = 'ignore')
outer_kf = StratifiedKFold(n_splits=5,shuffle=False,random_state=1)
inner_kf = StratifiedKFold(n_splits=3,shuffle=False,random_state=2)
params = {
'max_depth':np.linspace(1, 32, 8, endpoint=True),
'n_estimators': [1, 2, 4, 8, 16, 32, 64, 200],
'min_samples_split': np.linspace(0.1, 1.0, 5, endpoint=True),
'min_samples_leaf':np.linspace(0.1, 0.5, 5, endpoint=True)
}
outer_loop_accuracy_scores = []
outer_loop_roc_auc_scores = []
inner_loop_won_params = []
inner_loop_accuracy_scores = []
for train_index,test_index in outer_kf.split(df_x,df_y.values.ravel()):
GSCV = GridSearchCV(estimator=rf,param_grid=params,cv=inner_kf, verbose = 0)
ohe_x = ohe.fit_transform(df_x[train_index]).to_array()
ohe_x_test = ohe.transform(df_x[test_index]).to_array()
# GSCV is looping through the training data to find the best parameters. This is the inner loop
GSCV.fit(ohe_x,df_y[train_index])
# The best hyper parameters from GSCV is now being tested on the unseen outer loop test data.
pred = GSCV.predict(ohe_x_test)
# Appending the "winning" hyper parameters and their associated accuracy score
inner_loop_won_params.append(GSCV.best_params_)
outer_loop_accuracy_scores.append(accuracy_score(df_y[test_index],pred))
outer_loop_roc_auc_scores.append(roc_auc_score(df_y[test_index],pred))
inner_loop_accuracy_scores.append(GSCV.best_score_)
for i in zip(inner_loop_won_params,outer_loop_roc_auc_scores,outer_loop_accuracy_scores,inner_loop_accuracy_scores):
print(i)
print('Mean of outer loop accuracy score:',np.mean(outer_loop_accuracy_scores))
print('Mean of outer loop roc_auc score:',np.mean(outer_loop_roc_auc_scores))
print("---%0.1f minutes---" %((timeit.default_timer()-start_time)/60))
如何将我的结构与此代码行集成在一起? 我可以在循环中使用管道,还是应该丢弃管道? 我如何使用Onehotencoder?由于出现不同的功能值,因此出现列错误。
***似乎正在发生问题,因为df_x对我来说是一个DataFrame。我用 df_x.values和df_y.values.ravel(),现在可以使用了。
我现在可以确定有错误并且可以正常工作,但是我仍然不得不问相同的逻辑和流水线问题。