Question

我有一个分类问题，其中一个预测变量是一个分类变量X，它有四个等级A，B，C，D，它被转换为三个虚拟变量A，B，C。我试图使用插入符包中的递归特征选择（RFE）来进行特征选择。如何告诉RFE功能一起考虑A，B，C，D？所以如果说排除A，B＆amp; C也被排除在外。

在与这一整天战斗之后，我仍然无处可去......使用公式界面喂养RFE也行不通。我认为RFE会自动将任何因素转换为虚拟变量。

以下是我的示例代码：

#rfe settings
lrFuncs$summary<- twoClassSummary
trainctrl <- trainControl(classProbs= TRUE,
                      summaryFunction = twoClassSummary)

ctrl<-rfeControl(functions=lrFuncs,method = "cv", number=3)

#Data pre-process to exclude nzv and highly correlated variables
x<-training[,c(1, 4:25, 27:39)]
x2<-model.matrix(~., data = x)[,-1]
nzv <- nearZeroVar(x2,freqCut = 300/1)
x3 <- x2[, -nzv]
corr_mat <- cor(x3)
too_high <- findCorrelation(corr_mat, cutoff = .9)
x4 <- x3[, -too_high]

excludes<-c(names(data.frame(x3[, nzv])),names(data.frame(x3[, too_high])))

#Exclude the variables identified
x_frame<-x[ , -which(names(x) %in% c(excludes))]

#Run rfe
set.seed((408))
#This does not work with the error below
glmProfile<-rfe(x_frame,y,sizes =subsets, rfeControl = ctrl,trControl =trainctrl,metric = "ROC")
Error in { : task 1 failed - "undefined columns selected"
In addition: Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
3: glm.fit: fitted probabilities numerically 0 or 1 occurred 

#it works if convert x_frame to matrix and then back to data frame, but this way rfe may remove some dummy variables (i.e.remove A but leave B&C)
glmProfile<-rfe(data.frame(model.matrix(~., data = x_frame)[,-1]),y,sizes =subsets, rfeControl = ctrl,trControl =trainctrl,metric = "ROC")

x_frame此处包含具有多个级别的分类变量。

非常感谢任何帮助！

Answer 1

首先：是的，你是对的，你不能在RFE中使用分类功能（Max here on CV对此有合理的解释）。有趣的是，将所有级别编码为虚拟变量确实会导致错误，可以通过删除一个虚拟变量来解决。因此，我也会通过从分类变量中创建虚拟变量来预处理数据，而忽略一个级别。

但我不会尝试最终保留全部或全部虚拟变量。如果RFE抛出其中一些（但不是全部），那么某些级别似乎比其他级别拥有更多有价值的信息。这应该是合理的。想象一下，A，B，C的A级为您的目标变量保存有价值的信息。如果在创建虚拟变量期间保留A，则可能会被RFE丢弃B和C.如果在虚拟变量创建期间丢弃A，则B和C可能都由RFE保存。

PS：在混合连续和分类信息时，请考虑在将数据交给RFE之前相应地扩展数据，以确保连续和分类信息对RFE的影响有些类似。

Caret RFE处理作为相同分类变量级别的虚拟变量

1 个答案: