使用R,我想将数据帧随机分成三个较小的数据帧。第一个占观察总数的80%。第二个和第三个分别占总观测值的15%和5%。三个数据帧不能有任何重叠。你有什么建议吗?
答案 0 :(得分:3)
这是一个快速功能,可以根据您在'props'参数中指定的值来分割成任意数量的组。它应该是相当自我解释的
#' Splits data.frame into arbitrary number of groups
#'
#' @param dat The data.frame to split into groups
#' @param props Numeric vector. What proportion of the data should
#' go in each group?
#' @param which.adjust Numeric. Which group size should we 'fudge' to
#' make sure that we sample enough (or not too much)
split_data <- function(dat, props = c(.8, .15, .05), which.adjust = 1){
# Make sure proportions are positive
# and the adjustment group isn't larger than the number
# of groups specified
stopifnot(all(props >= 0), which.adjust <= length(props))
# could check to see if the sum is 1
# but this is easier
props <- props/sum(props)
n <- nrow(dat)
# How large should each group be?
ns <- round(n * props)
# The previous step might give something that
# gives sum(ns) > n so let's force the group
# specified in which.adjust to be a value that
# makes it so that sum(ns) = n
ns[which.adjust] <- n - sum(ns[-which.adjust])
ids <- rep(1:length(props), ns)
# Shuffle ids so that the groups are randomized
which.group <- sample(ids)
split(dat, which.group)
}
split_data(mtcars)
split_data(mtcars, c(.7, .3))
答案 1 :(得分:2)
通过操纵parts
向量,您应该能够生成任意数量的唯一集合 -
totrows <- nrow(dat)
rownos <- seq(totrows)
parts <- c(0.8,0.15,0.05)
sets <- vector(mode = "list", length = length(parts))
for( i in seq(parts))
{
# calculating random % row numbers, % specified by parts[i]
sets[[i]] <- sample(x = rownos, size = parts[i]*totrows)
# removing used row nos
rownos <- setdiff(rownos, sets[[i]])
}
如果您想要重叠集,可以删除循环中的setdiff
语句。