Hadley Wickham proposed可以使用dplyr
包进行引导,他的建议是was improved,然后是implemented in broom
package。是否也可以实现k折交叉验证?
我想第一步(选择 train 组)非常简单:
crossvalidate <- function (df, k = 5) {
n <- nrow(df)
idx <- sample(rep_len(1:k, n))
attr(df, "indices") <- lapply(1:k, function(i) which(idx != i))
attr(df, "drop") <- TRUE
attr(df, "group_sizes") <- nrow(df) - unclass(table(idx))
attr(df, "biggest_group_size") <- max(attr(df, "group_sizes"))
attr(df, "labels") <- data.frame(replicate = 1:k)
attr(df, "vars") <- list(quote(replicate))
class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")
df
}
但不知怎的,我无法找到attr(, "indices")
的任何文档来了解是否有可能使用索引&#34;其他&#34; 选择< em> test 组索引。你有什么想法?
答案 0 :(得分:2)
https://rpubs.com/dgrtwo/cv-modelr - 你有一个使用dplyr
包进行k折叠交叉验证的例子:
library(ISLR)
library(dplyr)
library(purrr)
library(modelr)
library(broom)
library(tidyr)
set.seed(1)
models <- Smarket %>%
select(Today, Lag1:Lag5) %>%
crossv_kfold(k = 20) %>%
mutate(model = map(train, ~ lm(Today ~ ., data = .)))
predictions <- models %>%
unnest(map2(model, test, ~ augment(.x, newdata = .y)))
predictions %>%
summarize(MSE = mean((Today - .fitted) ^ 2),
MSEIntercept = mean((Today - mean(Today))^2))
答案 1 :(得分:0)
这是使用dplyr分层5倍CV的一种解决方案:
df_fold = df %>%
group_by(group_var) %>%
sample_frac(1) %>%
mutate(fold=rep(1:5, length.out=n())) %>%
ungroup
for(i in 1:5){
val = df_fold %>% filter(fold==i)
tr = df_fold %>% anti_join(val, by=ID_var)
}