按组选择随机样本,R中包含附加条件

时间:2016-12-19 13:03:37

标签: r random

基于this post,我试图制作行样本。使用相同的R iris数据示例。我已经为每个物种正确创建了15行的样本

Selec_ir<-iris[ with(iris, unlist(tapply(seq_len(nrow(iris)),
                          Species, FUN = sample, 15,replace=FALSE))), ]

但是现在如何根据新选择的行必须至少在最后一个选定行的20行之后的条件创建样本?

1 个答案:

答案 0 :(得分:0)

以下函数将用于传递数据集中每个组的所有row_numbers,然后无需替换即可绘制sample,然后使用组合删除所有属于步长的值splitfindInterval。使用所需的样本步骤,返回的数组将用于slice所需的样本大小。

根据需要修改sample_sizesample_step,以调整初始样本数量和保留样本之间的行数

library(plyr)

sample_drop <- function(x, sample_size, sample_step=1) {

  # draw sample and convert to list
  lst_samp <- list(sort(sample(x, size=sample_size, replace=FALSE)))

  # function to split last element of list by step size
  split_last <- function(lst, step) {
    lst_tail <- unlist(tail(lst, n=1L))
    split(lst_tail, findInterval(lst_tail, c(0, step) + min(lst_tail)))
  }

  # split list until all values of last element fall within step size
  while(do.call(function(x) max(x) - min(x), list(unlist(tail(lst_samp, n=1L)))) >= sample_step) {
    lst_samp <- c(head(lst_samp, n=-1L), split_last(lst_samp, sample_step))
  }

  #lst_samp <- llply(lst_samp, unname) # for debug only to remove attr names
  laply(lst_samp, min) # return minimum value from each element

}

以下是应用于iris数据集的函数。

library(dplyr)

data("iris")

sample <- list()
sample$seed <- 1
sample$size <- 15L
sample$step <- 20L

# simulate sample draws with dropping and compare to iris results
set.seed(sample$seed)
sample_drop(50, sample$size, sample$step)
sample_drop(50, sample$size, sample$step)
sample_drop(50, sample$size, sample$step)

set.seed(sample$seed)
iris %>%
  group_by(Species) %>%
  mutate(gid=row_number()) %>%
  slice(sample_drop(n(), sample$size, sample$step))

以下是应用于较大diamonds数据集

的函数
library(dplyr)
library(ggplot2)

data("diamonds")

sample <- list()
sample$seed <- 1
sample$size <- 1000L
sample$step <- 20L

set.seed(sample$seed)
diamonds %>%
  group_by(cut) %>%
  mutate(gid=row_number()) %>%
  slice(sample_drop(n(), sample$size, sample$step))

set.seed(sample$seed)
diamonds %>%
  group_by(cut) %>%
  mutate(gid=row_number()) %>%
  slice(sample_drop(n(), sample$size, sample$step)) %>%
  summarise(samples=n())

可能还有改进的余地,但这对我来说更容易理解