我有一个数据集,其中包含超过800K
行和66
列/功能。我正在使用xgboost
和carte
来训练5k-Fold Cross-Validation
模型。但是,由于以下两列,我的R会话始终崩溃;即使我将亚马逊实例与以下规格配合使用。我正在使用
m5.4xlarge 16 64 EBS-Only Up to 10 3,500
# A tibble: 815,885 x 66
first_tile last_tile
<fct> <fct>
1 Filly Brown Body of Evidence
2 The Dish The Hunger Games
3 Waiting for Guffman Hell's Kitchen N.Y.C.
4 The Age of Innocence The Lake House
5 Malevolence In the Name of the Father
6 Old Partner Desperate Measures
7 Lady Jane The Invasion
8 Mad Dog Time Eye of the Needle
9 Beauty Is Embarrassing Funny Lady
10 The Snowtown Murders Alvin and the Chipmunks
11 Superman II Pina
12 Leap of Faith Capote
13 The Royal Tenenbaums Dead Men Don't Wear Plaid
14 School for Scoundrels Tarzan
15 Rhinestone Cocoon: The Return
16 Burn After Reading Death Defying Acts
17 The Doors Half Baked
18 The Wood Dance of the Dead
19 Jason X Around the World in 80 Days
20 Dragon Wars LOL
## Model Training
libray(caret)
set.seed(42)
split <- 0.8
train_index <- createDataPartition(data_tbl$paid, p = split, list = FALSE)
data_train <- data_tbl[train_index, ]
data_test <- data_tbl[-train_index, ]
## Summarise The Target Variable
table(dat_train$paid) / nrow(data_train)
## Create Train/Test Indexes
## Create train/test indexes
## preserve class indices
set.seed(42)
my_folds <- createFolds(data_train$paid, k = 5)
# Compare class distribution
i <- my_folds$Fold1
table(data_train$paid[i]) / length(i)
## Reusing trainControl
my_control <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = my_folds
)
model_xgb <- train(
paid ~. ,
data = data_train,
metric = "ROC",
method = "xgbTree",
trControl = myControl)
您能建议我每次都能解决此内存问题吗?
是否可以对这些功能进行某种形式的热编码?
我希望得到任何建议或帮助吗?
有没有办法知道我需要多大的机器?
预先感谢
答案 0 :(得分:0)
在机器学习世界中,有多种方法可以解决此类问题。
您真的需要全部66个功能吗?您是否执行过功能选择技术?您是否尝试过摆脱对您的预测没有任何帮助的功能?在这里查看R的一些功能选择机制: https://dataaspirant.com/2018/01/15/feature-selection-techniques-r/
假设您需要大多数或所有功能,现在想对这些分类变量进行编码,那么热门似乎是一种流行的选择,但是还有其他编码技术。我的选择之一是二进制编码。但是,您也可以探索其他编码技术:https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
xgboost也具有二次采样机制。您是否尝试过使用数据样本进行训练?在此处查看xgboost的子采样功能:https://xgboost.readthedocs.io/en/latest/parameter.html