当测试集中不存在响应变量时,h2o预测有时会失败

时间:2017-07-04 08:55:17

标签: r h2o

当在不存在响应变量的测试集上进行预测时,如果在训练中将一个热编码用于因子变量,则h2o会以各种不同的方式失败,无论是在训练GLM时是否隐式指定,或者在明确指定GLM时在其他方法。

此错误出现在R 3.4.0和h2o 3.12.0.1中。我们还测试了h2o 3.10.3.3

 library(h2o)
localH2O = h2o.init()

prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = read.csv(prostatePath)
prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:380),1))))

prostate.hex<-as.h2o(prostate.hex)
prostate.hex$weight<-1

prostate_train<-prostate.hex[1:300,]
prostate_test<-prostate.hex[301:380,]
prostate_test<-prostate_test[,-3] #delete response variable from test data

model<-h2o.glm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train,offset_column="weight")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

model<-h2o.glm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train)
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

model<-h2o.gbm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train,categorical_encoding = "OneHotExplicit")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

使用偏移列训练的第一个GLM示例在预测测试数据时会生成所有NaN。 第二个GLM示例产生此错误:

DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0

DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0
    at water.MRTask.getResult(MRTask.java:478)
    at water.MRTask.getResult(MRTask.java:486)
    at water.MRTask.doAll(MRTask.java:390)
    at water.MRTask.doAll(MRTask.java:396)
    at hex.glm.GLMModel.predictScoreImpl(GLMModel.java:1215)
    at hex.Model.score(Model.java:1077)
    at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
    at hex.DataInfo.extractDenseRow(DataInfo.java:1025)
    at hex.glm.GLMScore.map(GLMScore.java:148)
    at water.MRTask.compute2(MRTask.java:657)
    at water.H2O$H2OCountedCompleter.compute1(H2O.java:1352)
    at hex.glm.GLMScore$Icer.compute1(GLMScore$Icer.java)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1348)
    ... 5 more

Error: DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0

GBM示例产生此错误(即使测试数据中缺少的唯一列是响应变量):

java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
    at hex.Model.adaptTestForTrain(Model.java:1028)
    at hex.Model.adaptTestForTrain(Model.java:854)
    at hex.Model.score(Model.java:1072)
    at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Error: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set

错误似乎特定于因子变量并明确使用一个热编码。可以通过添加“假冒”来解决这个问题。测试数据集的响应列(我们对此进行了测试,此列的值对预测没有任何影响,正如我们预期的那样),但这显然不理想。

如果有5个或更多因子水平,即使列车和测试集中都存在所有因子水平,错误仍然存​​在:

prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:5),76))))

如果有4个或更少,则GLM没有问题,但GBM的错误消息仍然是

0 个答案:

没有答案