如何训练具有大量分类特征的模型:: RStudio崩溃

时间:2019-07-25 18:36:15

标签: r amazon-ec2 memory-management xgboost

我有一个数据集,其中包含超过800K行和66列/功能。我正在使用xgboostcarte来训练5k-Fold Cross-Validation模型。但是,由于以下两列,我的R会话始终崩溃;即使我将亚马逊实例与以下规格配合使用。我正在使用

Amazon EC2实例类型

m5.4xlarge 16 64 EBS-Only Up to 10 3,500

# A tibble: 815,885 x 66
   first_tile             last_tile                  
   <fct>                  <fct>                      
 1 Filly Brown            Body of Evidence           
 2 The Dish               The Hunger Games           
 3 Waiting for Guffman    Hell's Kitchen N.Y.C.      
 4 The Age of Innocence   The Lake House             
 5 Malevolence            In the Name of the Father  
 6 Old Partner            Desperate Measures         
 7 Lady Jane              The Invasion               
 8 Mad Dog Time           Eye of the Needle          
 9 Beauty Is Embarrassing Funny Lady                 
10 The Snowtown Murders   Alvin and the Chipmunks    
11 Superman II            Pina                       
12 Leap of Faith          Capote                     
13 The Royal Tenenbaums   Dead Men Don't Wear Plaid  
14 School for Scoundrels  Tarzan                     
15 Rhinestone             Cocoon: The Return         
16 Burn After Reading     Death Defying Acts         
17 The Doors              Half Baked                 
18 The Wood               Dance of the Dead          
19 Jason X                Around the World in 80 Days
20 Dragon Wars            LOL   


 ## Model Training
        libray(caret)
        set.seed(42)
        split <- 0.8
        train_index <- createDataPartition(data_tbl$paid, p = split, list = FALSE)
        data_train  <- data_tbl[train_index, ]
        data_test <-   data_tbl[-train_index, ]


    ## Summarise The Target Variable
    table(dat_train$paid) / nrow(data_train)


    ## Create Train/Test Indexes
    ## Create train/test indexes
    ## preserve class indices
    set.seed(42)
    my_folds <- createFolds(data_train$paid, k = 5)

    # Compare class distribution
    i <- my_folds$Fold1
    table(data_train$paid[i]) / length(i)

    ## Reusing trainControl
    my_control <- trainControl(
      summaryFunction = twoClassSummary,
      classProbs = TRUE,
      verboseIter = TRUE,
      savePredictions = TRUE,
      index = my_folds
      )

    model_xgb <- train(
        paid ~. ,
        data = data_train,
        metric = "ROC",
       method = "xgbTree",
       trControl = myControl)
  • 您能建议我每次都能解决此内存问题吗?

  • 是否可以对这些功能进行某种形式的热编码?

  • 我希望得到任何建议或帮助吗?

  • 有没有办法知道我需要多大的机器?

预先感谢

1 个答案:

答案 0 :(得分:0)

在机器学习世界中,有多种方法可以解决此类问题。