在R

时间:2017-11-26 14:15:27

标签: r for-loop logistic-regression cross-validation r-caret

我是R的新手并尝试在R中对以下数据集进行五次交叉验证进行逻辑回归,以用于预测目的:

> str(data_known)
'data.frame':   91708 obs. of  16 variables:
$ order_item_id     : int  1 2 3 4 5 6 7 8 11 12 ...
$ order_date        : Factor w/ 12 levels "2012-04","2012-05",..: 6 8 
10 5 6 12 10 3 10 12 ...
$ item_id           : int  1507 1745 2588 164 1640 2378 1506 224 2640 
2624 ...
$ item_size         : Factor w/ 99 levels "1","10","10+",..: 95 2 98 53
92 47 95 53 95 53 ...
$ brand_id          : int  102 64 42 47 97 72 102 58 41 12 ...
$ item_price        : num  24.9 75 79.9 79.9 69.9 ...
$ user_id           : int  46943 60979 72232 41242 8810 15761 64795
23489 69092 23261 ...
$ user_title        : Factor w/ 5 levels "Company","Family",..: 4 4 4 4 
4 4 4 4 4 4 ...
$ user_dob          : Factor w/ 607 levels "1943-01","1943-02",..: 263
368 80 216 342 274 NA 264 274 246 ...
$ user_state        : Factor w/ 16 levels "Baden-Wuerttemberg",..: 11 4
12 16 1 10 13 10 13 2 ...
$ user_reg_date     : Factor w/ 26 levels "2011-02","2011-03",..: 1 4
24 19 12 1 23 1 24 1 ...
$ delivery_time_days: Factor w/ 18 levels "0","1","2","3",..: 3 5 3 6 4 
12 5 5 11 4 ...
$ user_title_NA     : num  0 0 0 0 0 0 0 0 0 0 ...
$ item_size_NA      : num  0 0 0 0 0 0 0 0 0 0 ...
$ user_dob_NA       : num  0 0 0 0 0 0 1 0 0 0 ...
$ target            : Factor w/ 2 levels "Return","No Return": 1 2 1 1
1 1 2 1 2 1 ...

对于回归,我首先尝试使用插入符解决方案:

train_control<-trainControl(method = "cv", number = 5)
model.lr.cv<-train(formula = data_known$target ~., data = data_known,   
trControl = train_control, method = "glm", family = binomial(link = 
"logit"))

并收到错误消息:

"Error in na.fail.default(list(order_item_id = c(1L, 2L, 3L, 4L, 5L, 
6L,  : missing values in object"

然后我尝试通过for循环运行glm来解决问题,阅读:

explanatory_variables.lr.cv<-names(data_known)[-c(16)]
form.lr.cv<-as.formula(paste("target ~", 
paste(explanatory_variables.lr.cv, collapse = "+")))  
folds.lr.cv<-split(data_known,cut(sample(1:nrow(data_known)),5))
list_models.lr.cv<-list()
list_raw_pred.lr.cv<-list()

for (i in 1:length(folds.lr.cv)) {
test.lr.cv<-ldply(folds.lr.cv[i],data.frame)
train.lr.cv<-ldply(folds.lr.cv[-i],data.frame)
tmp.model.lr.cv<-glm(form.lr.cv, data = train.lr.cv, family =   
binomial(link = "logit"))
list_models.lr.cv[[i]]<-tmp.model.lr.cv
tmp.predict.lr.cv<-predict(tmp.model.lr.cv, newdata=test.lr.cv, type =  
"response")
list_raw_pred.lr.cv[[i]]<-tmp.predict.lr.cv
  }

但随后出现以下错误:

"Error in model.frame.default(Terms, newdata, na.action = na.action,   
xlev = object$xlevels) : Factor 'item_size' has new levels 45+"

由于这两个错误似乎是无关的,我不确定究竟是什么问题,更不用说如何解决它了。 &#34;正常&#34; glm没有交叉验证工作正常btw。

非常感谢您的帮助。

最好,

尼科

0 个答案:

没有答案