如何将数据帧随机分成具有给定行数的三个较小的数据帧

时间:2013-11-18 06:13:16

标签: r dataframe

使用R,我想将数据帧随机分成三个较小的数据帧。第一个占观察总数的80%。第二个和第三个分别占总观测值的15%和5%。三个数据帧不能有任何重叠。你有什么建议吗?

2 个答案:

答案 0 :(得分:3)

这是一个快速功能,可以根据您在'props'参数中指定的值来分割成任意数量的组。它应该是相当自我解释的

#' Splits data.frame into arbitrary number of groups
#' 
#' @param dat The data.frame to split into groups
#' @param props Numeric vector. What proportion of the data should
#'              go in each group?
#' @param which.adjust Numeric. Which group size should we 'fudge' to
#'              make sure that we sample enough (or not too much)
split_data <- function(dat, props = c(.8, .15, .05), which.adjust = 1){

    # Make sure proportions are positive
    # and the adjustment group isn't larger than the number
    # of groups specified
    stopifnot(all(props >= 0), which.adjust <= length(props))

    # could check to see if the sum is 1
    # but this is easier
    props <- props/sum(props)
    n <- nrow(dat)
    # How large should each group be?
    ns <- round(n * props)
    # The previous step might give something that
    # gives sum(ns) > n so let's force the group
    # specified in which.adjust to be a value that
    # makes it so that sum(ns) = n
    ns[which.adjust] <- n - sum(ns[-which.adjust])

    ids <- rep(1:length(props), ns)
    # Shuffle ids so that the groups are randomized
    which.group <- sample(ids)
    split(dat, which.group)
}

split_data(mtcars)
split_data(mtcars, c(.7, .3))

答案 1 :(得分:2)

通过操纵parts向量,您应该能够生成任意数量的唯一集合 -

totrows <- nrow(dat)
rownos <- seq(totrows)
parts <- c(0.8,0.15,0.05)

sets <- vector(mode = "list", length = length(parts))

for( i in seq(parts))
{
  # calculating random % row numbers, % specified by parts[i]
  sets[[i]] <- sample(x = rownos, size = parts[i]*totrows)
  # removing used row nos
  rownos <- setdiff(rownos, sets[[i]])
}

如果您想要重叠集,可以删除循环中的setdiff语句。