在R

时间:2018-09-12 09:37:40

标签: r cross-validation

我是统计方面的初学者,这是我第一次使用R做回归模型。我已经为简单的lm模型完成了基本的70/30拆分,我现在想编写代码进行交叉验证,而无需使用插入符号包。 这是我的代码:

cv_ss <- model2[sample(nrow(model2)),]
folds <- cut(seq(1,nrow(cv_ss)),breaks=6, labels = FALSE)
for(i in 1:6){
  testIndexes <- which(folds==i,arr.ind = FALSE)
  testData <- cv_ss[testIndexes, ]
  trainData <- cv_ss[-testIndexes, ]
  model_cv_ss <- lm(trainData$ss ~., data = trainData)
  summary(model_cv_ss)
  pred_cv_ss<-predict(model_cv_ss,testData)
  actual_cv_ss<-testData[,"ss"]

  MAPE_ss = mean(abs((pred_cv_ss - actual_cv_ss)/actual_cv_ss))
  MAPE_ss

  error_ss = abs((pred_cv_ss- actual_cv_ss)/actual_cv_ss)
  error_ss
  save_ss<- rbind(save_ss, new)
}
new <- data.frame(MAPE_ss, error_ss)
#}
#save_ss<- rbind(save_ss, new)
save_ss

avg_error_ss <- mean(MAPE_ss)
avg_error_ss

这是我的结果:

 MAPE_ss   error_ss
11  0.4012435 0.01960784
10  0.4012435 0.20384160
7   0.4012435 0.70386888
151  0.4012435 0.67765551
.
.
.

我很困惑,因为: 我只有21个观测值,但观测值可以达到151。但是,在全球环境中,它仅表示21个观测值。 其次,我所有的MAPE_ss都是相同的,我认为情况并非如此,因为每个折叠都吸收不同的数据,因此MAPE_ss应该是不同的。

如果我错了,请纠正我。

任何帮助或建议将不胜感激,在此先感谢您! :)

编辑: dim(model2)

> dim(model2)
[1] 25 13

dput(head(model2))

> dput(head(model2,20))
structure(list(ï..SST = c(0, 
0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1), SSA = c(0, 
0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), SSR = c(0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), SSC = c(0, 
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0), SSF = c(1, 
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1), SSFH = c(0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), SSTC = c(1, 
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0), SSH = c(1, 
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1), SSC = c(1, 
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1), SSW = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), QTY = c(109, 
45, 48, 49, 31, 81, 13, 48, 109, 45, 48, 49, 31, 81, 13, 48, 
36, 47, 58, 37), SS = c(53000, 47000, 5e+05, 
450000, 62000, 950000, 660000, 1060000, 530000, 480000, 520000, 
430000, 630000, 970000, 650000, 1090000, 1230374, 1695561, 981224, 
1130354), TTH. = c(60, 45, 45, 45, 45, 90, 90, 90, 
60, 45, 45, 45, 45, 95, 90, 90, 60, 90, 40, 65)), row.names = c(NA, 
20L), class = "data.frame")

0 个答案:

没有答案