当在不存在响应变量的测试集上进行预测时,如果在训练中将一个热编码用于因子变量,则h2o会以各种不同的方式失败,无论是在训练GLM时是否隐式指定,或者在明确指定GLM时在其他方法。
此错误出现在R 3.4.0和h2o 3.12.0.1中。我们还测试了h2o 3.10.3.3
library(h2o)
localH2O = h2o.init()
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = read.csv(prostatePath)
prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:380),1))))
prostate.hex<-as.h2o(prostate.hex)
prostate.hex$weight<-1
prostate_train<-prostate.hex[1:300,]
prostate_test<-prostate.hex[301:380,]
prostate_test<-prostate_test[,-3] #delete response variable from test data
model<-h2o.glm(y = "AGE", x = c("our_factor"),
training_frame = prostate_train,offset_column="weight")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)
model<-h2o.glm(y = "AGE", x = c("our_factor"),
training_frame = prostate_train)
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)
model<-h2o.gbm(y = "AGE", x = c("our_factor"),
training_frame = prostate_train,categorical_encoding = "OneHotExplicit")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)
使用偏移列训练的第一个GLM示例在预测测试数据时会生成所有NaN。 第二个GLM示例产生此错误:
DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0
DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0
at water.MRTask.getResult(MRTask.java:478)
at water.MRTask.getResult(MRTask.java:486)
at water.MRTask.doAll(MRTask.java:390)
at water.MRTask.doAll(MRTask.java:396)
at hex.glm.GLMModel.predictScoreImpl(GLMModel.java:1215)
at hex.Model.score(Model.java:1077)
at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at hex.DataInfo.extractDenseRow(DataInfo.java:1025)
at hex.glm.GLMScore.map(GLMScore.java:148)
at water.MRTask.compute2(MRTask.java:657)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1352)
at hex.glm.GLMScore$Icer.compute1(GLMScore$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1348)
... 5 more
Error: DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0
GBM示例产生此错误(即使测试数据中缺少的唯一列是响应变量):
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
at hex.Model.adaptTestForTrain(Model.java:1028)
at hex.Model.adaptTestForTrain(Model.java:854)
at hex.Model.score(Model.java:1072)
at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Error: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
错误似乎特定于因子变量并明确使用一个热编码。可以通过添加“假冒”来解决这个问题。测试数据集的响应列(我们对此进行了测试,此列的值对预测没有任何影响,正如我们预期的那样),但这显然不理想。
如果有5个或更多因子水平,即使列车和测试集中都存在所有因子水平,错误仍然存在:
prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:5),76))))
如果有4个或更少,则GLM没有问题,但GBM的错误消息仍然是