使用R的随机森林

时间:2016-03-16 04:34:25

标签: r machine-learning random-forest

我正致力于使用R构建乳腺癌数据的预测模型。执行gcrma规范化后,我生成了潜在的预测变量。现在,当我运行RF算法时,我遇到了以下错误

rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)

Error:   Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.

代码:

library(randomForest)
library(ROCR)
library(Hmisc)
library(genefilter)

setwd("E:/kavya's project_work/final")
datafile<-"trainset_gcrma.txt"
clindatafile<-read.csv("mod clinical_details.csv")

outfile="trainset_RFoutput.txt"
varimp_pdffile="trainset_varImps.pdf"
MDS_pdffile="trainset_MDS.pdf"
ROC_pdffile="trainset_ROC.pdf"
case_pred_outfile="trainset_CasePredictions.txt"
vote_dist_pdffile="trainset_vote_dist.pdf"

data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=clindatafile
clincaldata_order=order(clin_data_import[,"GEO.asscession.number"])
clindata=clin_data_import[clincaldata_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)

X=rawdata[,4:length(header)]
ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
filt=genefilter(2^X,ffun)
filt_Data=rawdata[filt,]



#Get potential predictor variables
predictor_data=t(filt_Data[,4:length(header)])
predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
colnames(predictor_data)=predictor_names


target= clindata[,"relapse"]
target[target==0]="NoRelapse"
target[target==1]="Relapse"
target=as.factor(target)

tmp = as.vector(table(target))
num_classes = length(tmp)
min_size = tmp[order(tmp,decreasing=FALSE)[1]]
sampsizes = rep(min_size,num_classes)
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)


error:"Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories."

因为我是机器学习的新手,我无法继续。请做必要的。 事先提前。

1 个答案:

答案 0 :(得分:0)

在不知道数据的情况下很难说。对所有预测变量运行scala> @enum(debug = true) class Status { | val Enabled, Disabled = Value | } { sealed abstract class Status(val index: Int, val name: String)(implicit sealant: Status.Sealant); object Status { @scala.annotation.implicitNotFound(msg = "Enum types annotated with ".+("@enum can not be extended directly. To add another value to the enum, ").+("please adjust your `def ... = Value` declaration.")) sealed abstract protected class Sealant; implicit protected object Sealant extends Sealant; case object Enabled extends Status(0, "Enabled") with scala.Product with scala.Serializable; case object Disabled extends Status(1, "Disabled") with scala.Product with scala.Serializable; val values: List[Status] = List(Enabled, Disabled); val fromIndex: _root_.scala.Function1[Int, Status] = Map(Enabled.index.->(Enabled), Disabled.index.->(Disabled)); val fromName: _root_.scala.Function1[String, Status] = Map(Enabled.name.->(Enabled), Disabled.name.->(Disabled)); def switch[A](pf: PartialFunction[Status, A]): _root_.scala.Function1[Status, A] = macro numerato.SwitchMacros.switch_impl[Status, A] }; () } defined class Status defined object Status class,以确保它们不会被意外解释为字符或因子。如果你确实有超过53个级别,则必须将它们转换为二进制变量。示例:

summary