dplyr中的k折交叉验证?

时间:2016-11-23 13:37:42

标签: r dplyr

Hadley Wickham proposed可以使用dplyr包进行引导,他的建议是was improved,然后是implemented in broom package。是否也可以实现k折交叉验证?

我想第一步(选择 train 组)非常简单:

crossvalidate <- function (df, k = 5) {
  n <- nrow(df)
  idx <- sample(rep_len(1:k, n))
  attr(df, "indices") <- lapply(1:k, function(i) which(idx != i))
  attr(df, "drop") <- TRUE
  attr(df, "group_sizes") <- nrow(df) - unclass(table(idx))
  attr(df, "biggest_group_size") <- max(attr(df, "group_sizes"))
  attr(df, "labels") <- data.frame(replicate = 1:k)
  attr(df, "vars") <- list(quote(replicate))
  class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")
  df
}

但不知怎的,我无法找到attr(, "indices")的任何文档来了解是否有可能使用索引&#34;其他&#34; 选择< em> test 组索引。你有什么想法?

2 个答案:

答案 0 :(得分:2)

https://rpubs.com/dgrtwo/cv-modelr - 你有一个使用dplyr包进行k折叠交叉验证的例子:

library(ISLR)
library(dplyr)
library(purrr)
library(modelr)
library(broom)
library(tidyr)

set.seed(1)

models <- Smarket %>%
  select(Today, Lag1:Lag5) %>%
  crossv_kfold(k = 20) %>%
  mutate(model = map(train, ~ lm(Today ~ ., data = .)))

predictions <- models %>%
  unnest(map2(model, test, ~ augment(.x, newdata = .y)))

predictions %>%
  summarize(MSE = mean((Today - .fitted) ^ 2),
            MSEIntercept = mean((Today - mean(Today))^2))

答案 1 :(得分:0)

这是使用dplyr分层5倍CV的一种解决方案:

df_fold = df %>%
  group_by(group_var) %>%
  sample_frac(1) %>%
  mutate(fold=rep(1:5, length.out=n())) %>%
  ungroup

for(i in 1:5){
  val = df_fold %>% filter(fold==i)
  tr = df_fold %>% anti_join(val, by=ID_var)
}