是否有一个简单的命令可以使用lm()函数进行一次性交叉验证?

时间:2017-10-31 09:06:05

标签: r cross-validation lm

是否有一个简单的命令可以使用R中的lm()函数进行一次性交叉验证?

具体来说,下面的代码有一个简单的命令吗?

x <- rnorm(1000,3,2)
y <- 2*x + rnorm(1000)

pred_error_sq <- c(0)
for(i in 1:1000) {
  x_i <- x[-i]
  y_i <- y[-i]
  mdl <- lm(y_i ~ x_i) # leave i'th observation out
  y_pred <- predict(mdl, data.frame(x_i = x[i])) # predict i'th observation
  pred_error_sq <- pred_error_sq + (y[i] - y_pred)^2 # cumulate squared prediction errors
}

y_squared <- sum((y-mean(y))^2)/100 # Variation of the data

R_squared <- 1 - (pred_error_sq/y_squared) # Measure for goodness of fit

5 个答案:

答案 0 :(得分:7)

另一种解决方案是使用caret

library(caret)

data <- data.frame(x = rnorm(1000, 3, 2), y = 2*x + rnorm(1000))

train(y ~ x, method = "lm", data = data, trControl = trainControl(method = "LOOCV"))
  

线性回归

     

1000个样本1个预测器

     

无预处理重采样:一次性交叉验证摘要   样本量:999,999,999,999,999,999 ......重新取样   结果:

     

RMSE Rsquared MAE
    1.050268 0.940619 0.836808

     

调整参数'intercept'保持不变,其值为TRUE

答案 1 :(得分:2)

您可以使用统计技巧来使用自定义函数,避免实际计算所有N个模型:

google.charts.load('current', {
  packages: ['corechart']
}).then(function () {
  var data = new google.visualization.DataTable();
  data.addColumn('date', 'Timestamp');
  data.addColumn('number', 'a');
  data.addColumn('number', 'b');

  var options = {
    hAxis: {
      title: 'Timestamp'
    },
    vAxis: {
      title: 'something'
    },
    tooltip: { isHtml: true },
    legend: {
      position: 'none'
    }
  };

  var chart = new google.visualization.LineChart(document.getElementById('chart_div'));

  dataCall();
  $interval(dataCall, 1000);

  function dataCall() {
    $http.get({x: "XYZ"}, successCallback, failureCallback);

    function successCallback(response) {
      data.addRow([new Date(), response.a, response.b]);
      chart.draw(data, options);
    }

    function failureCallback(response) {
      console.log(response);
    }
  }
});

这在此解释:{{3}} 它仅适用于线性模型 我想你可能想在公式中的平均值之后添加一个平方根。

答案 2 :(得分:1)

您可以尝试使用DAAG包中的cv.lm

cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area),
              m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
              main="Small symbols show cross-validation predicted values",
              legend.pos="topleft", printit = TRUE)

Arguments

data        a data frame
form.lm,    a formula or lm call or lm object
m           the number of folds
dots        uses pch=16 for the plotting character
seed        random number generator seed
plotit      This can be one of the text strings "Observed", "Residual", or a logical value. The logical TRUE is equivalent to "Observed", while FALSE is equivalent to "" (no plot)
main        main title for graph
legend.pos      position of legend: one of "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", "center".
printit     if TRUE, output is printed to the screen

答案 3 :(得分:0)

默认情况下,https://www.rdocumentation.org/packages/boot/versions/1.3-20/topics/cv.glm中的

cv.glm执行LOOCV,仅需要数据和lmglm函数。

答案 4 :(得分:0)

只需编写您自己的代码,即可使用索引变量来标记一个样本外的观察值。用插入号针对最高投票者测试此方法。尽管插入符号简单易用,但我的残酷方法花费的时间更少。 (代替lm,我使用LDA,但没什么大不同)

for (index in 1:dim(df)[1]){
   # here write your lm function
}