我建立了H2O(v.3.14)GLM模型。但是,当我使用h2o.predict检查预测时,根据我在验证集中使用的行数,我得到了非常不同的结果。
在前10行调用h2o.predict,我得到了:
# Predict using the first 10 lines in validation set
h2o.predict(glm.test, df.valid[1:10,])
# Result:
predict p0 p1
1 0 0.9999224 7.756014e-05
2 0 0.9962711 3.728930e-03
3 0 0.9997378 2.622195e-04
4 0 0.9999556 4.437544e-05
5 0 0.9998994 1.006037e-04
6 0 0.9999394 6.062479e-05
但如果我在第一个 100 行上调用h2o.predict,我会得到非常不同的结果。
h2o.predict(glm.test, df.valid[1:100,])
# Result:
predict p0 p1
1 1 0.06196439 0.9380356
2 1 0.15371122 0.8462888
3 1 0.01654756 0.9834524
4 1 0.12830090 0.8716991
5 1 0.07195659 0.9280434
6 1 0.09725532 0.9027447
我已经发布了重现问题的代码。数据集(非常稀疏)可以从https://www.dropbox.com/s/58ul6zrekpmjh20/dt.truth.csv.gz
下载h2o.removeAll()
# Note: The zipped data file can be downloaded from:
# https://www.dropbox.com/s/58ul6zrekpmjh20/dt.truth.csv.gz
df.truth <- h2o.importFile(
path="data/dt.truth.csv.gz", sep=",", header=T)
df.truth$isTarget <- h2o.asfactor(df.truth$isTarget)
# Split into train / test
splits <- h2o.splitFrame(df.truth, c(0.7), seed=1234)
df.train <- h2o.assign(splits[[1]], "df.train.hex")
df.valid <- h2o.assign(splits[[2]], "df.valid.hex")
# Build a GLM model
glm.test <- h2o.glm(
training_frame = df.train,
y="isTarget",
family = "binomial",
missing_values_handling = "MeanImputation",
seed = 1000000)
# Predict using the first 10 lines in validation set
h2o.predict(glm.test, df.valid[1:10,])
# Predict using the first 100 lines in validation set. Got very different result!
h2o.predict(glm.test, df.valid[1:100,])