以概率对数据帧中的随机行进行采样

时间:2017-04-25 10:43:33

标签: r random dataframe

如何随机抽取样本(有或没有替换),但有给定的概率?

我正在尝试在iris数据框中提取行的随机样本,但具有以下物种条件: 80%杂色和20%维吉尼亚

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa 

3 个答案:

答案 0 :(得分:3)

你可以在基础R中尝试这个:

f.sample <- function(a, percent) a[sample(nrow(a), nrow(a)*percent, replace = TRUE),]

f.sample(iris[iris$Species=="versicolor",], 0.8)
f.sample(iris[iris$Species=="virginica",], 0.2)

您可以相应地设置replace参数。

答案 1 :(得分:3)

我们可以使用quosures的devel版本中的dplyr(即将发布0.6.0)来创建函数

library(tidyverse)
f.sample <- function(dat, colN, value, perc){
       colN <- enquo(colN)
       value <- quo_name(enquo(value))
       dat %>%
            filter(UQ(colN) == UQ(value)) %>%
            sample_frac(perc) %>%
            droplevels
}

f.sample(iris, Species, versicolor, 0.8)
f.sample(iris, Species, virginica, 0.2)
#Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#20          6.0         2.2          5.0         1.5 virginica
#9           6.7         2.5          5.8         1.8 virginica
#15          5.8         2.8          5.1         2.4 virginica
#10          7.2         3.6          6.1         2.5 virginica
#12          6.4         2.7          5.3         1.9 virginica
#49          6.2         3.4          5.4         2.3 virginica
#22          5.6         2.8          4.9         2.0 virginica
#34          6.3         2.8          5.1         1.5 virginica
#2           5.8         2.7          5.1         1.9 virginica
#44          6.8         3.2          5.9         2.3 virginica

enquo通过获取输入参数并将其转换为substitute,而quosure转换为字符串,并在quo_name内,将filter/group_by/summarise/mutate!!的功能类似通过取消引用(UQf.sample2 <- function(dat, colN, values, perc){ colN <- enquo(colN) dat %>% filter(UQ(colN) %in% values) %>% droplevels %>% nest(-UQ(colN)) %>% .$data %>% setNames(values) %>% Map(sample_frac, ., perc) %>% bind_rows(.id = quo_name(colN)) } res <- f.sample2(iris, Species, c("versicolor", "virginica"), c(0.8, 0.2)) prop.table(table(res$Species)) #versicolor virginica # 0.8 0.2

来评估状态

根据以下评论,我们修改了该功能,以便它适用于其他情况

yourList.Except(yourList.Where(MethodGroup)).DoSomething();

答案 2 :(得分:3)

我似乎与其他回答者有不同的理解。

以下函数应生成80/20数据集,而不管原始数据集中的组大小。

foo <- function(DF, n = 50, group_var, groups, probs, replace = FALSE) {

  # subset relevant groups & split
  DF <- DF[DF[[group_var]] %in% groups, ]
  DF <- split(DF, as.character(DF[[group_var]]))
  DF <- DF[match(names(DF), groups)]

  # sample number of observations per group (this requires replace= TRUE)
  smpl <- sample(groups, size = n, replace = TRUE, prob = probs)
  # subset random rows per group according to group size
  DF <- Map(function(x,y) x[sample(1:nrow(x), y, replace = replace),], DF, c(table(smpl)))

  # combine and clean up
  DF <- do.call(rbind, DF)
  DF <- DF[sample(nrow(DF)),]  # not really necessary  
  row.names(DF) <- NULL        # not really necessary  
  DF
}


foo(iris, 50, "Species", c("versicolor", "virginica"), c(0.8, 0.2))