我是R的新手并尝试在R中对以下数据集进行五次交叉验证进行逻辑回归,以用于预测目的:
> str(data_known)
'data.frame': 91708 obs. of 16 variables:
$ order_item_id : int 1 2 3 4 5 6 7 8 11 12 ...
$ order_date : Factor w/ 12 levels "2012-04","2012-05",..: 6 8
10 5 6 12 10 3 10 12 ...
$ item_id : int 1507 1745 2588 164 1640 2378 1506 224 2640
2624 ...
$ item_size : Factor w/ 99 levels "1","10","10+",..: 95 2 98 53
92 47 95 53 95 53 ...
$ brand_id : int 102 64 42 47 97 72 102 58 41 12 ...
$ item_price : num 24.9 75 79.9 79.9 69.9 ...
$ user_id : int 46943 60979 72232 41242 8810 15761 64795
23489 69092 23261 ...
$ user_title : Factor w/ 5 levels "Company","Family",..: 4 4 4 4
4 4 4 4 4 4 ...
$ user_dob : Factor w/ 607 levels "1943-01","1943-02",..: 263
368 80 216 342 274 NA 264 274 246 ...
$ user_state : Factor w/ 16 levels "Baden-Wuerttemberg",..: 11 4
12 16 1 10 13 10 13 2 ...
$ user_reg_date : Factor w/ 26 levels "2011-02","2011-03",..: 1 4
24 19 12 1 23 1 24 1 ...
$ delivery_time_days: Factor w/ 18 levels "0","1","2","3",..: 3 5 3 6 4
12 5 5 11 4 ...
$ user_title_NA : num 0 0 0 0 0 0 0 0 0 0 ...
$ item_size_NA : num 0 0 0 0 0 0 0 0 0 0 ...
$ user_dob_NA : num 0 0 0 0 0 0 1 0 0 0 ...
$ target : Factor w/ 2 levels "Return","No Return": 1 2 1 1
1 1 2 1 2 1 ...
对于回归,我首先尝试使用插入符解决方案:
train_control<-trainControl(method = "cv", number = 5)
model.lr.cv<-train(formula = data_known$target ~., data = data_known,
trControl = train_control, method = "glm", family = binomial(link =
"logit"))
并收到错误消息:
"Error in na.fail.default(list(order_item_id = c(1L, 2L, 3L, 4L, 5L,
6L, : missing values in object"
然后我尝试通过for循环运行glm来解决问题,阅读:
explanatory_variables.lr.cv<-names(data_known)[-c(16)]
form.lr.cv<-as.formula(paste("target ~",
paste(explanatory_variables.lr.cv, collapse = "+")))
folds.lr.cv<-split(data_known,cut(sample(1:nrow(data_known)),5))
list_models.lr.cv<-list()
list_raw_pred.lr.cv<-list()
for (i in 1:length(folds.lr.cv)) {
test.lr.cv<-ldply(folds.lr.cv[i],data.frame)
train.lr.cv<-ldply(folds.lr.cv[-i],data.frame)
tmp.model.lr.cv<-glm(form.lr.cv, data = train.lr.cv, family =
binomial(link = "logit"))
list_models.lr.cv[[i]]<-tmp.model.lr.cv
tmp.predict.lr.cv<-predict(tmp.model.lr.cv, newdata=test.lr.cv, type =
"response")
list_raw_pred.lr.cv[[i]]<-tmp.predict.lr.cv
}
但随后出现以下错误:
"Error in model.frame.default(Terms, newdata, na.action = na.action,
xlev = object$xlevels) : Factor 'item_size' has new levels 45+"
由于这两个错误似乎是无关的,我不确定究竟是什么问题,更不用说如何解决它了。 &#34;正常&#34; glm没有交叉验证工作正常btw。
非常感谢您的帮助。
最好,
尼科