我想基于具有不平衡类的数据集创建随机森林分类。
现在我想基于所有特征的三重组合来计算每个随机森林分类的PPV(从而找到“最佳”标记组合,其在此定义为具有最高PPV和BACC的组合)。这是我正在使用的代码。
# 1) My testset
set.seed(5)
data <- data.frame(A=rnorm(100,10,5),B=rnorm(100,15,2),C=rnorm(100,20,5),
D=rnorm(100,3,1.5),E=rnorm(100,12,10),G=rnorm(100,12,10),
Class=c(rep("A",90), rep("B", 10)))
data[,"Class"] <- as.factor(data[,"Class"])
我的第一个建议是数据不能很好地分离,因为两个组(&#34; A&#34;&#34; B&#34;)的特征值来自同一个分布。
# 2) Create vector contain all combinations of 3 features (without the class)
allcombis <- combn(colnames(data)[-7], m = 3) #exclude column 5, the class column
dfpar <- apply(allcombis, 2, function(i) paste(i, collapse=" + "))
# 3) The output should be a dataframe containing all feature combinations and the PPV
dffinal <- data.frame(par= as.character(dfpar), TP=0, FP=0, TN=0, FN=0, PPV=0, BACC=0)
# 4) Create trainings and validation set
rows <- sample(rownames(data), replace = TRUE, size = length(rownames(data))*0.7)
train <- data[as.numeric(rows),]
validation <- data[-as.numeric(names(table(rows))),]
for (i in dfpar){
# Create random forest model
library(randomForest)
fit <- randomForest(as.formula(paste("Class", i, sep=" ~ ")),
data=train,
importance=TRUE,
ntree=1000)
# Apply random forest on validation dataset
Prediction <- predict(fit, validation)
confmatrix <- table(validation[,"Class"], Prediction)
# Calculate variable of interest: PPV
confmatrix_results <- confusionMatrix(confmatrix)
dffinal[which(dffinal[,"par"]==i), "TP"] <- signif(as.vector(confmatrix_results[["table"]][1,1]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "FP"] <- signif(as.vector(confmatrix_results[["table"]][1,2]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "FN"] <- signif(as.vector(confmatrix_results[["table"]][2,1]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "TN"] <- signif(as.vector(confmatrix_results[["table"]][2,2]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "PPV"] <- signif(as.vector(confmatrix_results[[4]]["Pos Pred Value"]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "BACC"] <- signif(as.vector(confmatrix_results[[4]]["Balanced Accuracy"]), digits = 6)
}
View(dffinal)
但结果是,一个标记组合的BACC为95%,PPV为1.经过一番阅读后,我发现了这个博客: http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
建议对代表性不足的群体进行过度抽样。
#######
# Exchange 4) with this: Create new trainings and new validation set
## I have a 9:1 ratio of classes in my dataset. Therefore I oversample the underrepresented group "B"
validatonrowsA <- sample(rownames(split(data, data[,"Class"])$A), replace = FALSE, size = length(rownames(data))*0.2)
validatonrowsB <- sample(rownames(split(data, data[,"Class"])$B), replace = TRUE, size = length(rownames(data))*0.2)
trainrowsA <- which(!(rownames(split(data, data[,"Class"])$A) %in% names(table(trainrowsA))), useNames = TRUE)
trainrowsB <- sample(rownames(split(data, data[,"Class"])$B), replace = TRUE, size = length(rownames(data))*0.8)
train <- data[c(trainrowsA, trainrowsB),]
validation <- data[c(validatonrowsA, validatonrowsB),]
但现在分类几近完美?代码有什么问题?我阅读了randomForest包的文档,但我没有找到答案。提前感谢您的建议!