如何随机抽取样本(有或没有替换),但有给定的概率?
我正在尝试在iris
数据框中提取行的随机样本,但具有以下物种条件:
80%杂色和20%维吉尼亚
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
答案 0 :(得分:3)
你可以在基础R中尝试这个:
f.sample <- function(a, percent) a[sample(nrow(a), nrow(a)*percent, replace = TRUE),]
f.sample(iris[iris$Species=="versicolor",], 0.8)
f.sample(iris[iris$Species=="virginica",], 0.2)
您可以相应地设置replace
参数。
答案 1 :(得分:3)
我们可以使用quosures
的devel版本中的dplyr
(即将发布0.6.0
)来创建函数
library(tidyverse)
f.sample <- function(dat, colN, value, perc){
colN <- enquo(colN)
value <- quo_name(enquo(value))
dat %>%
filter(UQ(colN) == UQ(value)) %>%
sample_frac(perc) %>%
droplevels
}
f.sample(iris, Species, versicolor, 0.8)
f.sample(iris, Species, virginica, 0.2)
#Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#20 6.0 2.2 5.0 1.5 virginica
#9 6.7 2.5 5.8 1.8 virginica
#15 5.8 2.8 5.1 2.4 virginica
#10 7.2 3.6 6.1 2.5 virginica
#12 6.4 2.7 5.3 1.9 virginica
#49 6.2 3.4 5.4 2.3 virginica
#22 5.6 2.8 4.9 2.0 virginica
#34 6.3 2.8 5.1 1.5 virginica
#2 5.8 2.7 5.1 1.9 virginica
#44 6.8 3.2 5.9 2.3 virginica
enquo
通过获取输入参数并将其转换为substitute
,而quosure
转换为字符串,并在quo_name
内,将filter/group_by/summarise/mutate
与!!
的功能类似通过取消引用(UQ
或f.sample2 <- function(dat, colN, values, perc){
colN <- enquo(colN)
dat %>%
filter(UQ(colN) %in% values) %>%
droplevels %>%
nest(-UQ(colN)) %>%
.$data %>%
setNames(values) %>%
Map(sample_frac, ., perc) %>%
bind_rows(.id = quo_name(colN))
}
res <- f.sample2(iris, Species, c("versicolor", "virginica"), c(0.8, 0.2))
prop.table(table(res$Species))
#versicolor virginica
# 0.8 0.2
)
根据以下评论,我们修改了该功能,以便它适用于其他情况
yourList.Except(yourList.Where(MethodGroup)).DoSomething();
答案 2 :(得分:3)
我似乎与其他回答者有不同的理解。
以下函数应生成80/20数据集,而不管原始数据集中的组大小。
foo <- function(DF, n = 50, group_var, groups, probs, replace = FALSE) {
# subset relevant groups & split
DF <- DF[DF[[group_var]] %in% groups, ]
DF <- split(DF, as.character(DF[[group_var]]))
DF <- DF[match(names(DF), groups)]
# sample number of observations per group (this requires replace= TRUE)
smpl <- sample(groups, size = n, replace = TRUE, prob = probs)
# subset random rows per group according to group size
DF <- Map(function(x,y) x[sample(1:nrow(x), y, replace = replace),], DF, c(table(smpl)))
# combine and clean up
DF <- do.call(rbind, DF)
DF <- DF[sample(nrow(DF)),] # not really necessary
row.names(DF) <- NULL # not really necessary
DF
}
foo(iris, 50, "Species", c("versicolor", "virginica"), c(0.8, 0.2))