H2o交叉验证不对应单次折叠训练/测试

时间:2020-07-28 14:03:45

标签: python machine-learning random-forest h2o

我试图了解当通过'fold_column'参数指定折叠时,交叉定价如何在H2o中工作。图书馆说:

fold_column选项指定数据集中的列, 包含每个观察结果的交叉验证折叠索引分配。

我假设对于每个交叉验证迭代,将fold_column = i的行用作测试集,其余的行用作训练集。但是,如果我改为使用这些拆分分别训练和测试模型,则会得到不同的性能结果。在下面的示例中,我创建一个值在1到5之间的列用作拆分索引,并使用它运行H2o交叉验证(带有fold_column参数)。之后,我使用同一列来训练和测试具有相同索引的模型并比较结果。

这是一个可复制的示例:

h2o.init()
from h2o.estimators import H2ORandomForestEstimator
import numpy as np
import pandas as pd

# Import the prostate dataset
prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")

# Set the predictor names and the response column name
response = "CAPSULE"
predictors = prostate.names[3:8]

# Convert the response column to a factor
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()

# Add column with random value between 1 and 5 to use for cross-validation
np.random.seed(21)
random_folds = np.random.randint(1, 6, len(prostate))
df_folds = pd.DataFrame(random_folds, columns=['folds'])
df_h20 = prostate.cbind(h2o.H2OFrame(df_folds))


##### Train the model using H2o cross-validation #####

# Train model using fold_column argument
drf = H2ORandomForestEstimator(fold_column = 'folds', max_depth=5, ntrees=1, seed=21)
drf.train(x=predictors, y=response, training_frame=df_h20)

# Get folds prediction single models
models = drf.cross_validation_models()

# Prin test and train AUC performance for each CV-fold
print('Fold 1, AUC (test) {} AUC (train) {}'.format(models[0].auc(valid=True), models[0].auc(train=True)))
print('Fold 2, AUC (test) {} AUC (train) {}'.format(models[1].auc(valid=True), models[1].auc(train=True)))
print('Fold 3, AUC (test) {} AUC (train) {}'.format(models[2].auc(valid=True), models[2].auc(train=True)))
print('Fold 4, AUC (test) {} AUC (train) {}'.format(models[3].auc(valid=True), models[3].auc(train=True)))
print('Fold 5, AUC (test) {} AUC (train) {}'.format(models[4].auc(valid=True), models[4].auc(train=True)))


##### Train the model on a single K-fold without using H2o cross-validation #####

# select one the of the 5 folds and create test/train set
test = df_h20[df_h20['folds'] == 1]
train = df_h20[df_h20['folds'] != 1]

# Train the model
drf = H2ORandomForestEstimator(max_depth=5, ntrees=1, seed=21)
drf.train(x=predictors,
                       y=response,
                       training_frame=train,
                       validation_frame=test
                 )


perf_valid = drf.model_performance(test)
perf_train = drf.model_performance(train)
print('AUC (test) {} AUC (train) {}'.format(perf_valid.auc(), perf_train.auc()))

输出为:

Fold 1, AUC (test) 0.8352221702976504 AUC (train) 0.835269468426379

Fold 2, AUC (test) 0.8215820406943912 AUC (train) 0.8203464750008381

Fold 3, AUC (test) 0.833563260744653 AUC (train) 0.8376839384943596

Fold 4, AUC (test) 0.8295902318635076 AUC (train) 0.8287798683714774

Fold 5, AUC (test) 0.825246953403821 AUC (train) 0.8264781593374212

AUC (test) 0.838142980551675 AUC (train) 0.8382107902781438

在不使用H2o交叉验证的情况下,经过一次折叠训练和测试的模型结果与5倍交叉验证的5个结果中的任何一个都不对应,这与我的预期不符。我实际上期望看到的最新结果与5个CV折叠之一相对应。据我了解,H2o交叉验证应该以我在代码最后一部分中所做的相同方式在内部训练模型。

有人知道为什么会这样吗?

编辑::我添加了参数ntrees = 1。我这样做是为了降低模型的复杂性,确保我们只处理一个决策树。我还为两个模型添加了参数seed = 21。

1 个答案:

答案 0 :(得分:0)

当在单个折叠/整个训练数据上训练时,没有理由期望任何折叠的 AUC 应该对应于 AUC。

在单次折叠/整个训练数据上训练时的 AUC 应该大致对应于每个折叠的 AUC 平均值。