Question

我正在努力适应Logistic Ridge回归并开发出如下模型;我需要帮助编码测试它的准确性和ROC / AUC曲线的阈值。

我的编码如下：

拟合模型

library(glmnet)
library(caret)

data1<-read.csv("D:\\Research\\Final2.csv",header=T,sep=",")
str(data1)
'data.frame':   154 obs. of  12 variables:
$ Earningspershare : num  12 2.69 8.18 -0.91 3.04 ...
 $ NetAssetsPerShare: num  167.1 17.2 41.1 14.2 33 ...
$ Dividendpershare : num  3 1.5 1.5 0 1.25 0 0 0 0 0.5 ...
 $ PE               : num  7.35 8.85 6.66 -5.27 18.49 ...
 $ PB               : num  0.53 1.38 1.33 0.34 1.7 0.23 0.5 3.1 0.5 0.3 ...
$ ROE              : num  0.08 0.16 0.27 -0.06 0.09 -0.06 -0.06 0.15 0.09 0.
 $ ROA              : num  0.02 0.09 0.14 -0.03 0.05 -0.04 -0.05 0.09 0.03 0
$ Log_MV           : num  8.65 10.38 9.81 8.3 10.36 ..
$ Return_yearly    : int  0 1 0 0 0 0 0 0 0 0 ...
$ L3               : int  0 0 0 0 0 0 0 0 0 0 ...
$ L6               : int  0 0 0 0 0 0 0 0 0 0 ...
$ Sector           : int  2 2 2 2 2 1 2 2 4 1 ...

smp_size <- floor(0.8 * nrow(data1))
set.seed(123)
train_ind <- sample(seq_len(nrow(data1)), size = smp_size)
train <- data1[train_ind, ]
test <- data1[-train_ind, ]

train$Return_yearly <-as.factor(train$Return_yearly)
train$L3 <-as.factor(train$L3)
train$L6 <-as.factor(train$L6)
train$Sector <-as.factor(train$Sector)

train$L3 <-model.matrix( ~ L3 - 1, data=train)
train$L6 <-model.matrix( ~ L6 - 1, data=train)
train$Sector<-model.matrix( ~ Sector - 1, data=train)

x <- model.matrix(Return_yearly ~., train)
y <- train$Return_yearly

ridge.mod <- glmnet(x, y=as.factor(train$Return_yearly), family='binomial', alpha=0, nlambda=100, lambda.min.ratio=0.0001)

set.seed(1)
cv.out <- cv.glmnet(x, y=as.factor(train$Return_yearly), family='binomial', alpha=0, nfolds = 5, type.measure = "auc", nlambda=100, lambda.min.ratio=0.0001)
plot(cv.out)
best.lambda <- cv.out$lambda.min
best.lambda
[1] 5.109392

测试模型

test$L3 <-as.factor(test$L3)
test$L6 <-as.factor(test$L6)
test$Sector <-as.factor(test$Sector)
test$Return_yearly <-as.factor(test$Return_yearly)

test$L3 <-model.matrix( ~ L3 - 1, data=test)
test$L6 <-model.matrix( ~ L6 - 1, data=test)
test$Sector<-model.matrix( ~ Sector - 1, data=test)

newx <- model.matrix(Return_yearly ~., test)
y.pred <- as.matrix(ridge.mod,newx=newx, type="class",data=test)

比较准确性测试;弹出错误，无法继续

compare <- cbind (actual=test$Return_yearly, y.pred)
Warning message:
   In cbind(actual = test$Return_yearly, y.pred) :
   number of rows of result is not a multiple of vector length (arg 1)

Answer 1

如果没有可重现的数据集，那就猜错了：

由于将L3和L6转换为因子，列车和测试矩阵具有不同的列。默认情况下，as.factor（）在因子中创建与唯一值一样多的级别，因此如果火车/测试分裂具有不同的唯一值L3或L6，则由model.matrix创建的虚拟变量的数量（）也会有所不同。

可能的解决方案：在训练/测试分裂之前执行as.factor（），或者提供具有完整级别的as.factor，例如

train$L3 <- as.factor(train$L3, levels = unique(data1$L3))

Logistic Ridge回归使用R预测ROC / AUC和准确度测试代码

1 个答案: