Question

我有一个数据集，其中包含超过800K行和66列/功能。我正在使用xgboost和carte来训练5k-Fold Cross-Validation模型。但是，由于以下两列，我的R会话始终崩溃；即使我将亚马逊实例与以下规格配合使用。我正在使用

Amazon EC2实例类型

m5.4xlarge 16 64 EBS-Only Up to 10 3,500

# A tibble: 815,885 x 66
   first_tile             last_tile                  
   <fct>                  <fct>                      
 1 Filly Brown            Body of Evidence           
 2 The Dish               The Hunger Games           
 3 Waiting for Guffman    Hell's Kitchen N.Y.C.      
 4 The Age of Innocence   The Lake House             
 5 Malevolence            In the Name of the Father  
 6 Old Partner            Desperate Measures         
 7 Lady Jane              The Invasion               
 8 Mad Dog Time           Eye of the Needle          
 9 Beauty Is Embarrassing Funny Lady                 
10 The Snowtown Murders   Alvin and the Chipmunks    
11 Superman II            Pina                       
12 Leap of Faith          Capote                     
13 The Royal Tenenbaums   Dead Men Don't Wear Plaid  
14 School for Scoundrels  Tarzan                     
15 Rhinestone             Cocoon: The Return         
16 Burn After Reading     Death Defying Acts         
17 The Doors              Half Baked                 
18 The Wood               Dance of the Dead          
19 Jason X                Around the World in 80 Days
20 Dragon Wars            LOL   


 ## Model Training
        libray(caret)
        set.seed(42)
        split <- 0.8
        train_index <- createDataPartition(data_tbl$paid, p = split, list = FALSE)
        data_train  <- data_tbl[train_index, ]
        data_test <-   data_tbl[-train_index, ]


    ## Summarise The Target Variable
    table(dat_train$paid) / nrow(data_train)


    ## Create Train/Test Indexes
    ## Create train/test indexes
    ## preserve class indices
    set.seed(42)
    my_folds <- createFolds(data_train$paid, k = 5)

    # Compare class distribution
    i <- my_folds$Fold1
    table(data_train$paid[i]) / length(i)

    ## Reusing trainControl
    my_control <- trainControl(
      summaryFunction = twoClassSummary,
      classProbs = TRUE,
      verboseIter = TRUE,
      savePredictions = TRUE,
      index = my_folds
      )

    model_xgb <- train(
        paid ~. ,
        data = data_train,
        metric = "ROC",
       method = "xgbTree",
       trControl = myControl)

您能建议我每次都能解决此内存问题吗？
是否可以对这些功能进行某种形式的热编码？
我希望得到任何建议或帮助吗？
有没有办法知道我需要多大的机器？

预先感谢

Answer 1

在机器学习世界中，有多种方法可以解决此类问题。

您真的需要全部66个功能吗？您是否执行过功能选择技术？您是否尝试过摆脱对您的预测没有任何帮助的功能？在这里查看R的一些功能选择机制： https://dataaspirant.com/2018/01/15/feature-selection-techniques-r/
假设您需要大多数或所有功能，现在想对这些分类变量进行编码，那么热门似乎是一种流行的选择，但是还有其他编码技术。我的选择之一是二进制编码。但是，您也可以探索其他编码技术：https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
xgboost也具有二次采样机制。您是否尝试过使用数据样本进行训练？在此处查看xgboost的子采样功能：https://xgboost.readthedocs.io/en/latest/parameter.html

如何训练具有大量分类特征的模型:: RStudio崩溃

1 个答案: