超参数调整整个数据集?

时间:2018-04-11 14:23:05

标签: python machine-learning hyperparameters

这可能是一个奇怪的问题,因为我还没有完全理解超参数调整。

目前,我正在使用gridSearchCV sklearn来调整randomForestClassifier的参数,如下所示:

gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)
results = gs.cv_results_

之后,我会检查gs对象best_paramsbest_score。现在我使用best_params来实例化RandomForestClassifier并再次使用分层验证来记录指标并打印混淆矩阵:

rf = RandomForestClassifier(n_estimators=1000, min_samples_leaf=7, max_depth=18, criterion='entropy', random_state=42)
accuracy = []
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0

print('################################################### RandomForest ###################################################')
for train_index, test_index in skf.split(X_Distances,Y):
    X_train, X_test = X_Distances[train_index], X_Distances[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)

    precision, recall, fscore, support = np.round(score(y_test, y_pred), 2)
    metrics['accuracy'].append(round(accuracy_score(y_test, y_pred), 2))
    metrics['precision'].append(precision)
    metrics['recall'].append(recall)
    metrics['fscore'].append(fscore)
    metrics['support'].append(support)

    print(classification_report(y_test, y_pred))
    matrix = confusion_matrix(y_test, y_pred)
    methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
    counter = counter+1

meanAcc= round(np.mean(np.asarray(metrics['accuracy'])),2)*100
print('meanAcc: ', meanAcc)

这是一种合理的方法还是我完全错了?

编辑:

我刚测试了以下内容:

gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)

这会在best_score = 0.5362903225806451处产生best_index = 28。当我在索引28处检查3倍的准确度时,我得到:

  1. split0:0.5185929648241207
  2. split1:0.526686807653575
  3. split2:0.5637651821862348
  4. 这导致平均测试准确度:0.5362903225806451。 best_params:{'criterion': 'entropy', 'max_depth': 21, 'min_samples_leaf': 5}

    现在我运行这个代码,它使用上面提到的best_params和一个分层的3倍分割(如GridSearchCV):

    rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, max_depth=21, criterion='entropy', random_state=42)
    accuracy = []
    metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
    counter = 0
    print('################################################### RandomForest_Gini ###################################################')
    for train_index, test_index in skf.split(X_Distances,Y):
        X_train, X_test = X_Distances[train_index], X_Distances[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_test)
    
        precision, recall, fscore, support = np.round(score(y_test, y_pred))
        metrics['accuracy'].append(accuracy_score(y_test, y_pred))
        metrics['precision'].append(precision)
        metrics['recall'].append(recall)
        metrics['fscore'].append(fscore)
        metrics['support'].append(support)
    
        print(classification_report(y_test, y_pred))
        matrix = confusion_matrix(y_test, y_pred)
        methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
        counter = counter+1
    
    meanAcc= np.mean(np.asarray(metrics['accuracy']))
    print('meanAcc: ', meanAcc)
    

    指标dictionairy产生完全相同的准确度(split0:0.5185929648241207,split1:0.526686807653575,split2:0.5637651821862348)

    然而,平均计算有点偏差:0.5363483182213101

1 个答案:

答案 0 :(得分:3)

虽然这似乎是一种很有前景的方法,但你冒了风险: 您正在调整,然后使用相同的数据集评估此调整的结果。

虽然在某些情况下这是一种合法的方法,但我会仔细检查您在结束时获得的指标与报告的best_score之间的差异。如果这些距离很远,你应该只在训练集上调整模型(你现在正在使用所有东西进行调整)。实际上,这意味着事先执行拆分并确保GridSearchCV没有看到测试集。

这可以这样做:

train_x, train_y, val_x, val_y = train_test_split(X_distances, Y, test_size=0.3, random_state=42)

然后,您将在train_x, train_y上运行调整和培训。

另一方面,如果两个分数接近,我猜你很高兴。