亲爱的程序员和统计员,
我想知道当使用包cforest()
中的函数party
时,我们如何检索(或计算)R ^ 2。来自同名包的函数randomForest()
返回确定系数,而cforest()
则不然。我在这里阅读https://stats.stackexchange.com/questions/7357/manually-calculated-r2-doesnt-match-up-with-randomforest-r2-for-testing,使用包randomForest()
中的以下公式计算R ^ 2:
R2<-1 - sum((y-predicted)^2)/sum((y-mean(y))^2) # y is the actual value
然而,当我比较来自randomForest()
和cforest()
的R ^ 2时,我发现了一个巨大的差异:
#### Minimal reproducible example ####
### Vectors ###
ARTICLE<-c("Yes", "Yes", "No", "Yes", "No", "No",
"Yes", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No", "No",
"No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No", "No", "No", "No",
"No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",
"No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",
"Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",
"No", "No", "No", "No", "No", "No", "Yes", "No", "No")
COMPSYNT<-c("NP", "NP", "DetPoss", "NP", "NP", "NP", "NP", "NP", "NP",
"NP", "NP", "NP", "PronPers", "NP", "PronForm", "NP", "NP", "NP", "PronForm", "NP", "PronForm", "NP", "PronForm", "PronForm",
"NP", "PronPers", "PronForm", "NP", "DetPoss", "NP", "PronForm", "NFClau", "PronForm", "NP", "NP", "NP", "PronForm", "PronForm", "PronForm",
"NP", "NP", "NP", "PronForm", "PronForm", "NP", "NP", "PronForm", "PronForm", "NP", "PronForm", "PronForm", "NP", "NFClau", "NP",
"PronForm", "NP", "NP", "NP", "NP", "NP", "NP", "PronForm", "PronForm", "NP", "NP", "NP", "PronForm", "NP", "PronForm",
"NP", "PronForm", "NFClau", "PronForm", "NP", "NP", "NFClau", "PronForm", "NP", "NP", "NP", "PronForm", "PronForm", "PronForm", "NP",
"PronForm", "NP", "NP", "PronForm", "PronForm", "PronForm", "NP", "PronForm", "PronPers", "NP", "NP")
POSITION<-c("Fin", "Fin", "Med", "Med", "Fin", "Fin", "Fin", "Fin", "Fin",
"Fin", "Med", "Fin", "Init", "Fin", "Med", "Fin", "Fin", "Fin", "Init", "Fin", "Init", "Init", "Init", "Init",
"Fin", "Fin", "Fin", "Fin", "Init", "Init", "Init", "Fin", "Init", "Init", "Fin", "Fin", "Init", "Init", "Init",
"Fin", "Fin", "Med", "Med", "Init", "Init", "Fin", "Fin", "Init", "Fin", "Fin", "Fin", "Fin", "Med", "Init",
"Init", "Med", "Fin", "Fin", "Init", "Init", "Med", "Init", "Init", "Fin", "Fin", "Init", "Init", "Init", "Init",
"Fin", "Fin", "Med", "Init", "Fin", "Fin", "Med", "Init", "Fin", "Fin", "Fin", "Init", "Init", "Fin", "Init",
"Init", "Fin", "Fin", "Init", "Init", "Init", "Fin", "Init", "Fin", "Fin", "Init")
COMPTYPE<-c("Abstr_1",
"Conc", "Hum", "Abstr_2", "Hum", "Hum", "Conc", "Hum", "Hum", "Hum", "Hum", "Hum", "Hum", "Hum", "Conc", "Hum",
"Hum", "Hum", "Hum", "Abstr_2", "Hum", "Hum", "Hum", "Hum", "Abstr_1", "Conc", "Abstr_1", "Conc", "Conc", "Abstr_1", "Conc",
"Abstr_2", "Hum", "Abstr_1", "Abstr_1", "Conc", "Conc", "Plant", "Hum", "Conc", "Abstr_2", "Conc", "Abstr_1", "Abstr_1", "Abstr_1", "Hum",
"Abstr_1", "Conc", "Hum", "Abstr_1", "Abstr_2", "Abstr_1", "Abstr_2", "Conc", "Hum", "Abstr_1", "Conc", "Abstr_1", "Hum", "Abstr_1", "Abstr_1",
"Hum", "Abstr_2", "Conc", "Abstr_1", "Conc", "Hum", "Conc", "Abstr_1", "Conc", "Abstr_1", "Abstr_2", "Conc", "Conc", "Hum", "Abstr_2",
"Conc", "Abstr_2", "Abstr_2", "Conc", "Abstr_2", "Conc", "Abstr_1", "Abstr_1", "Abstr_1", "Abstr_2", "Hum", "Hum", "Conc", "Abstr_2", "Abstr_1",
"Hum", "Abstr_2", "Conc", "Hum")
SUBSTYPE<-c("Repl",
"Repl", "Repl", "Contr", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Comp", "Repl", "Repl", "Repl",
"Comp", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Contr", "Contr", "Contr",
"Contr", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Repl", "Contr", "Contr",
"Contr", "Repl", "Contr", "Repl", "Repl", "Repl", "Contr", "Contr", "Repl", "Contr", "Repl", "Repl", "Repl", "Contr", "Contr",
"Repl", "Contr", "Repl", "Contr", "Repl", "Repl", "Contr", "Contr", "Contr", "Contr", "Contr", "Contr", "Contr", "Contr", "Contr",
"Contr", "Repl", "Repl", "Comp", "Repl", "Repl", "Repl", "Contr", "Repl", "Contr", "Contr", "Repl", "Repl", "Contr", "Contr",
"Repl", "Contr", "Repl", "Repl")
VARIANT<-c("1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2")
PERIOD<-c("1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "3", "3", "3", "3", "3", "4", "4", "4",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "3", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3", "4", "4", "4", "4", "4", "4", "4",
"4", "4")
PRED<-c(0.9479936898, 0.919449515, 0.9419154421, 0.5983387557,
0.6095731951, 0.6095731951, 0.919449515, 0.6095731951, 0.6095731951,
0.6095731951, 0.7030330529, 0.7525290886, 0.5973901173, 0.7525290886,
0.8111631081, 0.7758242732, 0.655754515, 0.7758242732, 0.3617200806,
0.204189421, 0.3617200806, 0.4091156245, 0.3617200806, 0.3617200806,
0.1909593012, 0.111197398, 0.1317200524, 0.1401576975, 0.3357625661,
0.0354262613, 0.3251898421, 0.0026529555, 0.3617200806, 0.1277255725,
0.1909593012, 0.1401576975, 0.0920054464, 0.0826276571, 0.3617200806,
0.1401576975, 0.204189421, 0.1362205175, 0.1076221699, 0.1021952872,
0.0354262613, 0.2225893662, 0.013977198, 0.0920054464, 0.2225893662,
0.1317200524, 0.1170159378, 0.1909593012, 0.0025081381, 0.0223554982,
0.3617200806, 0.0538830716, 0.1401576975, 0.1909593012, 0.4091156245,
0.0354262613, 0.0538830716, 0.3617200806, 0.0054761797, 0.1401576975,
0.051214367, 0.1171345461, 0.3617200806, 0.0223554982, 0.0141969626,
0.0331976869, 0.4525577246, 0.0023048079, 0.0103973282, 0.0331976869,
0.2786798396, 0.0025693648, 0.0119143655, 0.3508813284, 0.3508813284,
0.1910649906, 0.1038908339, 0.1222175396, 0.260972475, 0.0380847154,
0.1368486957, 0.0294733117, 0.3138516914, 0.4183846938, 0.1219226877,
0.0062738871, 0.0939148073, 0.4183846938, 0.3356194269, 0.3046349387,
0.4823614353)
DEV<-c(0.4479936898,
0.419449515, 0.4419154421, 0.0983387557, 0.1095731951, 0.1095731951,
0.419449515, 0.1095731951, 0.1095731951, 0.1095731951, 0.2030330529,
0.2525290886, 0.0973901173, 0.2525290886, 0.3111631081, 0.2758242732,
0.155754515, 0.2758242732, -0.1382799194, -0.295810579, -0.1382799194,
-0.0908843755, -0.1382799194, -0.1382799194, -0.3090406988,
-0.388802602, -0.3682799476, -0.3598423025, -0.1642374339,
-0.4645737387, -0.1748101579, -0.4973470445, -0.1382799194,
-0.3722744275, -0.3090406988, -0.3598423025, -0.4079945536,
-0.4173723429, -0.1382799194, -0.3598423025, -0.295810579,
-0.3637794825, -0.3923778301, -0.3978047128, -0.4645737387,
-0.2774106338, -0.486022802, -0.4079945536, -0.2774106338,
-0.3682799476, -0.3829840622, -0.3090406988, -0.4974918619,
-0.4776445018, -0.1382799194, -0.4461169284, -0.3598423025,
-0.3090406988, -0.0908843755, -0.4645737387, -0.4461169284,
-0.1382799194, -0.4945238203, -0.3598423025, -0.448785633,
-0.3828654539, -0.1382799194, -0.4776445018, -0.4858030374,
-0.4668023131, -0.0474422754, -0.4976951921, -0.4896026718,
-0.4668023131, -0.2213201604, -0.4974306352, -0.4880856345,
-0.1491186716, -0.1491186716, -0.3089350094, -0.3961091661,
-0.3777824604, -0.239027525, -0.4619152846, -0.3631513043,
-0.4705266883, -0.1861483086, -0.0816153062, -0.3780773123,
-0.4937261129, -0.4060851927, -0.0816153062, -0.1643805731,
-0.1953650613, -0.0176385647)
### Combining the vectors into a data frame ###
mydata<-as.data.frame(cbind(ARTICLE, COMPSYNT, COMPTYPE, DEV, PERIOD, POSITION, PRED, SUBSTYPE, VARIANT))
mydata$DEV<-as.numeric(as.character(mydata$DEV))
mydata$PRED<-as.numeric(as.character(mydata$PRED))
### First random forest on my data: 'randomForest' (package: 'randomForest') ###
set.seed(123)
mydata.rf1<-randomForest(DEV ~ ARTICLE + COMPSYNT + POSITION + COMPTYPE + SUBSTYPE + PERIOD, data=mydata, ntree=2000, mtry=2, importance=TRUE)
print(mydata.rf1)
Call:
randomForest(formula = DEV ~ ARTICLE + COMPSYNT + POSITION + COMPTYPE + SUBSTYPE + PERIOD, data = mydata, ntree = 2000, mtry = 2, importance = TRUE)
Type of random forest: regression
Number of trees: 2000
No. of variables tried at each split: 2
Mean of squared residuals: 0.0110483
% Var explained: 83.33
## MSE = 0.0110483
## pseudo-R^2 = 0.8333
### Second random forest on my data: 'cforest' (package: 'party') ###
set.seed(123)
mydata.rf2<-cforest(DEV ~ ARTICLE + COMPSYNT + POSITION + COMPTYPE + SUBSTYPE + PERIOD, data=mydata, controls=cforest_unbiased(ntree=2000, mtry=2))
oob.pred<-predict(mydata.rf2, type="response", OOB=TRUE)
residual<-DEV-oob.pred
mse<-sum(residual^2)/length(DEV)
pseudo.R2<-1-mse/var(DEV)
## MSE = 0.0380004
## pseudo-R^2 = 0.4327
我似乎无法弄清楚为什么我的两个R ^ 2值之间存在如此大的差异。我的问题如下:
1)当我们使用cforest()
时,可以使用上面R ^ 2的公式吗?如果是的话,为什么我会得到这样不同的价值?
2)当我们使用cforest()
?
我事先感谢你们的解释和建议。