带h2o的R中的分类数据

时间:2019-03-20 17:15:40

标签: r logistic-regression h2o categorical-data

我已经运行了具有类别变量和数值变量的逻辑回归模型。该模型试图根据第一周来预测一个月内的网站访问量。显然,第一周的网站访问量是最强的指标。但是,当我使用各种模型和激活功能进行深度学习时,模型的性能非常差。基于var_imp函数,它对非常不重要的变量具有重要性(基于我的逻辑回归模型,这非常好,这是错误的),并且似乎只对分类子集进行了高重要性排序。而且即使在训练数据上,该模型也无法正常运行,这是一个真实的警告信号!所以我只想上传我的代码以检查我是否在做任何不损害模型的事情。后勤回归使它正确无误,但深度学习却使它如此错误,这似乎很奇怪,所以我想我已经做了一些事情!

summary(data)
 $ VAR1: Factor w/ 8 levels ,..: 1 5 2 1 7 2 5 1 5 1 ...
  $ VAR2: Factor w/ 5 levels ,..: 1 4 1 1 4 4 4 1 1 4 ...
  $ VAR3: Factor w/ 2 levels "F","M": 2 2 2 1 2 2 2 2 2 2 ...
  $ VAR4: Factor w/ 2 levels : 2 1 2 2 1 1 1 2 2 1 ...
  $ VAR5         : num  1000 20 30 20 30 30 30 50 30 400 ...
  $ VAR6: Factor w/ 2 levels "N","Y": 1 2 2 1 2 2 2 2 1 2 ...
  $ VAR7: Factor w/ 2 levels "N","Y": 1 2 2 1 2 2 2 2 1 2 ...
  $ VAR8: num  0 0 0 0 0 0 0 0 0 0 ...
  $ VAR9: num  56 52 49 29 28 38 34 79 53 36 ...
  $ VAR10: num  3 2 1 3 2 2 3 4 2 2 ...
  $ VAR11: num  1 1 1 2 2 1 1 1 1 2 ...
  $ VAR12: Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
  $ VAR13: num  1 0 1 1 1 0 1 0 0 0 ...
  $ VAR14: Factor w/ 2 levels "N","Y": 2 1 1 1 1 1 1 1 1 1 ...
  $ VAR15: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
  $ VAR16: num  1 0 0 1 0 0 0 1 1 0 ...
  $ VAR17: num  19 7 1 4 10 2 4 4 7 12 ...
  $ VAR18: Factor w/ 2 levels "N","Y": 1 2 2 2 2 2 2 1 2 1 ...
  $ VAR19: Factor w/ 2 levels "0","Y": 1 1 2 1 1 1 1 1 1 1 ...
  $ VAR20: Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 1 1 1 1 ...
  $ VAR21: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
  $ VAR22:    : num  0.579 0 0 0 0.4 ...
  $ VAR23: num  1.89 1 1 1 2.9 ...
  $ VAR24: num  0.02962 0.00691 0.05327 0.02727 0.01043 ...
  $ VAR25: Factor w/ 3 levels ..: 2 2 2 3 3 2 3 2 1 3 ...
  $ VAR26: num  3 2 1 2 3 1 2 1 2 4 ...
  $ VAR27: num  3 2 1 1 5 1 1 1 1 2 ...
  $ VAR_RESPONSE: num  7 24 4 3 8 12 5 48 2 7 ...

sapply(data,function(x) sum(is.na(x)))

colSums(is.na(data))
data[is.na(data)] = 0



 d.hex = as.h2o(data, destination_frame= "d.hex")

 Data_g.split = h2o.splitFrame(data = d.hex,ratios = 0.75)
 Data_train = Data_g.split[[1]]#75% training data
 Data_test = Data_g.split[[2]]

 activation_opt <- 
c("Rectifier","RectifierWithDropout","Maxout","MaxoutWithDropout", 
"Tanh","TanhWithDropout")
 hidden_opt <- list(c(10,10),c(20,15),c(50,50,50),c(5,3,2),c(100,100),c(5),c(30,30,30),c(50,50,50,50),c(5,4,3,2))
 l1_opt <- c(0,1e-3,1e-5,1e-7,1e-9)
 l2_opt <- c(0,1e-3,1e-5,1e-7,1e-9)

 hyper_params <- list( activation=activation_opt,
                  hidden=hidden_opt,
                  l1=l1_opt,
                  l2=l2_opt )

 search_criteria <- list(strategy = "RandomDiscrete", max_models=30)

 dl_grid10 <- h2o.grid("deeplearning"
                ,grid_id = "deep_learn10"
                ,hyper_params = hyper_params
                ,search_criteria = search_criteria
                ,x = 1:27
                ,y = "VAR_RESPONSE"
                ,training_frame = Data_train)
d_grid10 <- h2o.getGrid("deep_learn10",sort_by = "mse")

mn = h2o.deeplearning(x = 1:27,
                 y = "VAR_RESPONSE",
                 training_frame = Data_train,
                 model_id = "mn",
                 activation = "Maxout",
                 l1 = 0,
                 l2 = 1e-9,
                 hidden = c(100,100),)

0 个答案:

没有答案