基于this post,我试图制作行样本。使用相同的R iris数据示例。我已经为每个物种正确创建了15行的样本
Selec_ir<-iris[ with(iris, unlist(tapply(seq_len(nrow(iris)),
Species, FUN = sample, 15,replace=FALSE))), ]
但是现在如何根据新选择的行必须至少在最后一个选定行的20行之后的条件创建样本?
答案 0 :(得分:0)
以下函数将用于传递数据集中每个组的所有row_numbers
,然后无需替换即可绘制sample
,然后使用组合删除所有属于步长的值split
和findInterval
。使用所需的样本步骤,返回的数组将用于slice
所需的样本大小。
根据需要修改sample_size
和sample_step
,以调整初始样本数量和保留样本之间的行数
library(plyr)
sample_drop <- function(x, sample_size, sample_step=1) {
# draw sample and convert to list
lst_samp <- list(sort(sample(x, size=sample_size, replace=FALSE)))
# function to split last element of list by step size
split_last <- function(lst, step) {
lst_tail <- unlist(tail(lst, n=1L))
split(lst_tail, findInterval(lst_tail, c(0, step) + min(lst_tail)))
}
# split list until all values of last element fall within step size
while(do.call(function(x) max(x) - min(x), list(unlist(tail(lst_samp, n=1L)))) >= sample_step) {
lst_samp <- c(head(lst_samp, n=-1L), split_last(lst_samp, sample_step))
}
#lst_samp <- llply(lst_samp, unname) # for debug only to remove attr names
laply(lst_samp, min) # return minimum value from each element
}
以下是应用于iris
数据集的函数。
library(dplyr)
data("iris")
sample <- list()
sample$seed <- 1
sample$size <- 15L
sample$step <- 20L
# simulate sample draws with dropping and compare to iris results
set.seed(sample$seed)
sample_drop(50, sample$size, sample$step)
sample_drop(50, sample$size, sample$step)
sample_drop(50, sample$size, sample$step)
set.seed(sample$seed)
iris %>%
group_by(Species) %>%
mutate(gid=row_number()) %>%
slice(sample_drop(n(), sample$size, sample$step))
以下是应用于较大diamonds
数据集
library(dplyr)
library(ggplot2)
data("diamonds")
sample <- list()
sample$seed <- 1
sample$size <- 1000L
sample$step <- 20L
set.seed(sample$seed)
diamonds %>%
group_by(cut) %>%
mutate(gid=row_number()) %>%
slice(sample_drop(n(), sample$size, sample$step))
set.seed(sample$seed)
diamonds %>%
group_by(cut) %>%
mutate(gid=row_number()) %>%
slice(sample_drop(n(), sample$size, sample$step)) %>%
summarise(samples=n())
可能还有改进的余地,但这对我来说更容易理解