GLM模型:h2o.predict根据验证数据中使用的行数给出非常不同的结果

时间:2017-11-21 03:43:33

标签: h2o

我建立了H2O(v.3.14)GLM模型。但是,当我使用h2o.predict检查预测时,根据我在验证集中使用的行数,我得到了非常不同的结果。

在前10行调用h2o.predict,我得到了:

# Predict using the first 10 lines in validation set
h2o.predict(glm.test, df.valid[1:10,])
# Result:
  predict        p0           p1
1       0 0.9999224 7.756014e-05
2       0 0.9962711 3.728930e-03
3       0 0.9997378 2.622195e-04
4       0 0.9999556 4.437544e-05
5       0 0.9998994 1.006037e-04
6       0 0.9999394 6.062479e-05

但如果我在第一个 100 行上调用h2o.predict,我会得到非常不同的结果。

h2o.predict(glm.test, df.valid[1:100,])
# Result:
  predict         p0        p1
1       1 0.06196439 0.9380356
2       1 0.15371122 0.8462888
3       1 0.01654756 0.9834524
4       1 0.12830090 0.8716991
5       1 0.07195659 0.9280434
6       1 0.09725532 0.9027447

我已经发布了重现问题的代码。数据集(非常稀疏)可以从https://www.dropbox.com/s/58ul6zrekpmjh20/dt.truth.csv.gz

下载
h2o.removeAll()

# Note: The zipped data file can be downloaded from:
#       https://www.dropbox.com/s/58ul6zrekpmjh20/dt.truth.csv.gz

df.truth <- h2o.importFile(
  path="data/dt.truth.csv.gz", sep=",", header=T)

df.truth$isTarget <- h2o.asfactor(df.truth$isTarget)

# Split into train / test
splits <- h2o.splitFrame(df.truth, c(0.7), seed=1234)
df.train <- h2o.assign(splits[[1]], "df.train.hex")   
df.valid <- h2o.assign(splits[[2]], "df.valid.hex")

# Build a GLM model
glm.test <- h2o.glm(         
  training_frame = df.train,        
  y="isTarget",                 
  family = "binomial",
  missing_values_handling = "MeanImputation",
  seed = 1000000) 

# Predict using the first 10 lines in validation set
h2o.predict(glm.test, df.valid[1:10,])

# Predict using the first 100 lines in validation set.  Got very different result!
h2o.predict(glm.test, df.valid[1:100,])

0 个答案:

没有答案