错误:H2o自动编码器训练中的模型很大

时间:2019-05-08 08:41:04

标签: h2o dimension reduction

我有一张5360 * 51200尺寸的桌子。此处,实例数为5360,要素数量为51200。我需要减小特征尺寸。我在H2o中借助堆叠式自动编码器进行了尝试,但是它不允许我进行如下操作以引发错误:

Model is a large and large number of parameters

代码如下:

library(h2o)
h2o.init(nthreads = -1)

check.deeplearning_stacked_autoencoder <- function() {
  # this function builds a vector of autoencoder models, one per layer
  #library(h2o)
  #h2o.init()
  get_stacked_ae_array <- function(training_data, layers, args) {
    vector <- c()
    index = 0
    for (i in 1:length(layers)) {
      index = index + 1
      ae_model <- do.call(h2o.deeplearning,
                          modifyList(
                            list(
                              x = names(training_data),
                              training_frame = training_data,
                              autoencoder = T,

                              hidden = layers[i]
                            ),
                            args
                          ))
      training_data = h2o.deepfeatures(ae_model, training_data, layer =
                                         3)

      names(training_data) <-
        gsub("DF", paste0("L", index, sep = ""), names(training_data))
      vector <- c(vector, ae_model)
    }
    cat(
      length(vector))
  }

  # this function returns final encoded contents
  apply_stacked_ae_array <- function(data, ae) {
    index = 0
    for (i in 1:length(ae)) {
      index = index + 1
      data = h2o.deepfeatures(ae[[i]], data, layer = 3)
      names(data) <-
        gsub("DF", paste0("L", index, sep = ""), names(data))
    }
    data
  }

  TRAIN <-
    "E:/Chiranjibi file/Geometric features/Lu/Train/d_features.csv"
  TEST <-
    "E:/Chiranjibi file/Geometric features/Lu/Test/d_features.csv"
  response <- 51201

  # set to T for RUnit
  # set to F for stand-alone demo
  if (T) {
    train_hex <- h2o.importFile((TRAIN))
    test_hex  <- h2o.importFile((TEST))
  } else 
  {
    library(h2o)
    h2o.init()
    homedir <-
      paste0(path.expand("~"), "/h2o-dev/") #modify if needed
    train_hex <-
      h2o.importFile(path = paste0(homedir, TRAIN),
                     header = F,
                     sep = ',')
    test_hex  <-
      h2o.importFile(path = paste0(homedir, TEST),
                     header = F,
                     sep = ',')
  }
  train <- train_hex[, -response]
  test  <- test_hex [, -response]
  train_hex[, response] <- as.factor(train_hex[, response])
  test_hex [, response] <- as.factor(test_hex [, response])

  ## Build reference model on full dataset and evaluate it on the test set
  model_ref <-
    h2o.deeplearning(
      training_frame = train_hex,
      x = 1:(ncol(train_hex) - 1),
      y = response,
      hidden = c(67),
      epochs = 50
    )
  p_ref <- h2o.performance(model_ref, test_hex)
  h2o.logloss(p_ref)

  ## Now build a stacked autoencoder model with three stacked layer AE models
  ## First AE model will compress the 717 non-const predictors into 200
  ## Second AE model will compress 200 into 100
  ## Third AE model will compress 100 into 50
  layers <- c(50000,20000,10000,5000,2000, 1000, 500)
  args <- list(activation = "Tanh",
               epochs = 1,
               l1 = 1e-5)
  ae <- get_stacked_ae_array(train, layers, args)

  ## Now compress the training/testing data with this 3-stage set of AE models
  train_compressed <- apply_stacked_ae_array(train, ae)
  test_compressed <- apply_stacked_ae_array(test, ae)

  ## Build a simple model using these new features (compressed training data) and evaluate it on the compressed test set.
  train_w_resp <- h2o.cbind(train_compressed, train_hex[, response])
  test_w_resp <- h2o.cbind(test_compressed, test_hex[, response])
  model_on_compressed_data <-
    h2o.deeplearning(
      training_frame = train_w_resp,
      x = 1:(ncol(train_w_resp) - 1),
      y = ncol(train_w_resp),
      hidden = c(67),
      epochs = 1
    )
  p <- h2o.performance(model_on_compressed_data, test_w_resp)
  h2o.logloss(p)


}
#h2o.describe(train)

#doTest("Deep Learning Stacked Autoencoder", check.deeplearning_stacked_autoencoder)

2 个答案:

答案 0 :(得分:1)

由于数据集具有51,200个要素,并且图层数组的第一个值为50,000,因此在第一组网络连接中权重为51200 * 50000 == 2.56e9。

太多了,尝试较小的数字。

答案 1 :(得分:1)

如汤姆所说,您的自动编码器第一层太大。

51,200是很多功能。它们之间有多少相关性?您拥有的相关性越多,自动编码器的第一层就可以越小。

尝试h2o.prcomp()并查看有多少尺寸覆盖99%的方差,通常可以很好地指导您第一层可以/应该有多大。

或者,如果您希望采用更具实验性的方法:

  • 以例如一层中有200个神经元。
  • 在经过足够长的时期以使其不再改善之后,看看它达到的MSE。
  • 使该层神经元的数量增加一倍。
  • 查看MSE是否有所改善。如果没有,就停在那里。
  • 如果确实如此,请再次加倍并重复。

然后您可以尝试移动到多个图层。但是,使用更大的第一层要比尝试尝试单个层所能获得的最好效果要好得多。