Question

我使用gbm（）函数来创建模型，我想获得准确性。这是我的代码：

df<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)

str(df)

F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])

library(caret)

set.seed(1000)
intrain<-createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train<-df[intrain, ]
test<-df[-intrain, ]

install.packages("gbm")
library("gbm")

df_boosting<-gbm(Creditability~.,distribution = "bernoulli", n.trees=100, verbose=TRUE, interaction.depth=4,
                 shrinkage=0.01, data=train)
summary(df_boosting)

yhat.boost<-predict (df_boosting ,newdata =test, n.trees=100)
mean((yhat.boost-test$Creditability)^2)

但是，使用摘要功能时，会出现错误。错误消息如下。

Error in plot.window(xlim, ylim, log = log, ...) : 
  유한한 값들만이 'xlim'에 사용될 수 있습니다
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf

并且，当使用平均函数测量MSE时，也会出现以下错误：

Warning message:
In Ops.factor(yhat.boost, test$Creditability) :
  요인(factors)에 대하여 의미있는 ‘-’가 아닙니다.

你知道为什么出现这两个错误吗？提前谢谢。

Answer 1

在您的代码中，问题在于（二进制）响应变量Creditability的定义。您将其声明为factor，但gbm需要一个数字响应变量。

以下是代码：

df <- read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)

F <- c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
str(df)

Creditability现在是二进制数值变量：

'data.frame':   1000 obs. of  21 variables:
 $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Account.Balance                  : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2 ...
 $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
 $ Payment.Status.of.Previous.Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3 ...
 $ Purpose                          : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4 ...
 ...

...代码的其余部分很好用：

library(caret)
set.seed(1000)
intrain <- createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train <- df[intrain, ]
test <- df[-intrain, ]

library("gbm")
df_boosting <- gbm(Creditability~., distribution = "bernoulli", 
       n.trees=100, verbose=TRUE, interaction.depth=4,
       shrinkage=0.01, data=train)
par(mar=c(3,14,1,1))
summary(df_boosting, las=2)

##########
                                                                var    rel.inf
Account.Balance                                     Account.Balance 36.8578980
Credit.Amount                                         Credit.Amount 12.0691120
Duration.of.Credit..month.               Duration.of.Credit..month. 10.5359895
Purpose                                                     Purpose 10.2691646
Payment.Status.of.Previous.Credit Payment.Status.of.Previous.Credit  9.1296524
Value.Savings.Stocks                           Value.Savings.Stocks  4.9620662
Instalment.per.cent                             Instalment.per.cent  3.3124252
...
##########

yhat.boost <- predict(df_boosting , newdata=test, n.trees=100)
mean((yhat.boost-test$Creditability)^2) 

[1] 0.2719788

希望这可以帮到你。

如何计算r中的GBM精度

1 个答案: