我编写了R代码,目的是进行正向搜索以从数据集中选择最佳特征,并建立具有较高AUC度量的SVM模型。
数据集由4000列(特征)和700行(实例)组成。
我对R编程非常幼稚,不知道如何改进此代码以使其更快地工作。
library("e1071")
library("ROCR")
library("AUC")
library("FSelector")
set.seed(658932)
matrix <- read.csv("combined_matrix.csv", header = TRUE)
data <- matrix
rownames(data) <- data[,1]
data<-data[,-1]
evaluator <- function(subset) {
#k-fold cross validation
k <- 5
#splits <- runif(nrow(data))
results = sapply(1:k, function(i) {
idx <- sample(1:nrow(data), nrow(data)*7/10, F)
test <- data[-idx, ]
train <- data[idx, ]
class1.svm.model <- svm(Class ~ ., data = train,metric="ROC",type="eps-regression",kernel="linear",na.action=na.omit,probability = TRUE)
#prediction and ROC
class1.svm.pred <- predict(class1.svm.model, test, probability = TRUE)
#head(attr(class1.svm.pred, "probabilities"))
c <- as.numeric(class1.svm.pred)
c = c - 1
pred <- prediction(c, test$Class)
perf <- performance(pred,"tpr","fpr")
area <- performance(pred,'auc')
plot(perf,fpr.stop=0.1)
abline(a=0, b= 1)
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
tpr_fpr <- performance(pred, "tpr", "fpr")
trp <- tpr_fpr@x.values[[1]]
fpr <- tpr_fpr@y.values[[1]]
area <- unlist(slot(area, "y.values"))
return (area)
})
print(subset)
write(subset,file = "outputforwardpc5.txt",append = TRUE)
print(mean(results))
write(mean(results),file = "outputforwardpc5.txt",append = TRUE)
write(mean(results),file = "outputforwardpc5.csv",append = TRUE)
return(mean(results))
}
subset <- forward.search(names(data)[-3854],evaluator)
write(subset,file = "outputforwardeclipse.txt",append = TRUE)
f <- as.simple.formula(subset, "Class")
print(f)
我试图进行forward.search以找到子集,并将其提供给SVM模型,该模型执行5折验证并返回曲线下的面积。
代码运行良好,但是,这需要大量的计算时间。有人可以建议我固定此R代码的方法吗?我正在具有多个处理器的64 GB RAM工作站中运行它。