我的目的是绘制自由度不同的三次平滑样条曲线的偏差方差分解。
首先,我模拟一个测试集(矩阵)和一个训练集(矩阵)。然后,我进行了100多次仿真,并在每次迭代中更改了平滑样条曲线的自由度。
使用以下代码获得的输出未显示任何折衷。计算偏差/方差时我做错了什么?
作为参考,此图的右侧面板(幻灯片14)显示了我期望的取舍(source)
rm(list = ls())
library(SimDesign)
set.seed(123)
n_sim <- 100
n_df <- 40
n_sample <- 100
mse_temp <- matrix(NA, nrow = n_sim, ncol = n_df)
var_temp <- matrix(NA, nrow = n_sim, ncol = n_df)
bias_temp <- matrix(NA, nrow = n_sim, ncol = n_df)
# Train data -----
x_train <- runif(n_sample, -0.5, 0.5)
f_train <- 0.8*x_train+sin(6*x_train)
epsilon_train <- replicate(n_sim, rnorm(n_sample,0,sqrt(2)))
y_train <- replicate(n_sim,f_train) + epsilon_train
# Test data -----
x_test <- runif(n_sample, -0.5, 0.5)
f_test <- 0.8*x_test+sin(6*x_test)
epsilon_test <- replicate(n_sim, rnorm(n_sample,0,sqrt(2)))
y_test <- replicate(n_sim,f_test) + epsilon_test
for (mc_iter in seq(n_sim)){
for (df_iter in seq(n_df)){
cspline <- smooth.spline(x_train, y_train[,mc_iter], df=df_iter+1)
cspline_predict <- predict(cspline, x_test)
mse_temp[mc_iter, df_iter] <- mean((y_test[,mc_iter] - cspline_predict$y)^2)
var_temp[mc_iter, df_iter] <- var(cspline_predict$y)
# bias_temp[mc_iter, df_iter] <- bias(cspline_predict$y, f_test)^2
bias_temp[mc_iter, df_iter] <- mean((replicate(n_sample, mean(cspline_predict$y))-f_test)^2)
}
}
mse_spline <- apply(mse_temp, 2, FUN = mean)
var_spline <- apply(var_temp, 2, FUN = mean)
bias_spline <- apply(bias_temp, 2, FUN = mean)
par(mfrow=c(1,3))
plot(seq(n_df),mse_spline, type = 'l')
plot(seq(n_df),var_spline, type = 'l')
plot(seq(n_df),bias_spline, type = 'l')
答案 0 :(得分:0)
实际上,我认为您的代码可以工作,只是样本量很小,您很快就遇到了过度拟合的区域,因此绘图中的所有内容都非常靠近左侧边界,处于几个自由度区域。如果增加n_sample
,您应该会看到预期的关系。