Question

**编辑，因为我是一个doofus - 有替换，而不是没有**

我有一个包含421组的大型（> 500k行）数据集，由两个分组变量定义。样本数据如下：

df<-data.frame(group_one=rep((0:9),26), group_two=rep((letters),10))

head(df)

  group_one group_two
1         0         a
2         1         b
3         2         c
4         3         d
5         4         e
6         5         f

......等等。

通过（group_one x group_two）中的成员资格，我想要的是分层样本的某个数字（目前k = 12，但这个数字可能会有所不同）。每个组中的成员资格应由新列sample_membership指示，其值为1到k（此时为12）。我应该能够通过sample_membership进行子集化，并获得最多12个不同的样本，每个样本在考虑group_one和group_two时都具有代表性。

最终数据集看起来像这样：

  group_one group_two sample_membership
1         0         a                 1  
2         0         a                12
3         0         a                 5
4         1         a                 5
5         1         a                 7
6         1         a                 9

思考？非常感谢提前！

Answer 1

也许是这样的？：

library(dplyr)
  df %>% 
    group_by(group_one, group_two) %>% 
    mutate(sample_membership = sample(1:12, n(), replace = FALSE))

Answer 2

这是一种单行data.table方法，如果你有一个长data.frame，你一定要考虑这个方法。

library(data.table)

setDT(df)

df[, sample_membership := sample.int(12, .N, replace=TRUE), keyby = .(group_one, group_two)]

df
#    group_one group_two sample_membership
#   1:         0         a                 9
#   2:         0         a                 8
#   3:         0         c                10
#   4:         0         c                 4
#   5:         0         e                 9
# ---                                      
# 256:         9         v                 4
# 257:         9         x                 7
# 258:         9         x                11
# 259:         9         z                 3
# 260:         9         z                 8

对于未经替换的抽样，请使用replace=FALSE，但如其他地方所述，请确保每个组的成员少于 k 。或

如果你想使用“没有不必要替换的抽样”（这样做 - 不知道这里有什么正确的术语），因为每组有超过 k 成员但仍想保留如果这些组的大小尽可能均匀，您可以执行以下操作：

# example with bigger groups k <- 12L big_df <- data.frame(group_one=rep((0:9),260), group_two=rep((letters),100)) setDT(big_df) big_df[, sample_round := rep(1:.N, each=k, length.out=.N), keyby = .(group_one, group_two)] big_df[, sample_membership := sample.int(k, .N, replace=FALSE), keyby = .(group_one, group_two, sample_round)] head(big_df, 15) # you can see first repeat does not occur until row k+1

在每个“抽样回合”（组中的第一个k观察，组中的第二个k观察等）中，存在没有替换的抽样。然后，如有必要，下一轮采样使所有k分配再次可用。

这种方法可以对样本进行真正均匀分层（但只有在每组中有多个k成员时才能完全均匀）。

Answer 3

这是一个基本R方法，它假定您的data.frame按组排序：

# get number of observations for each group
groupCnt <- with(df, aggregate(group_one, list(group_one, group_two), FUN=length))$x

# for reproducibility, set the seed
set.seed(1234)    
# get sample by group
df$sample <- c(sapply(groupCnt, function(i) sample(12, i, replace=TRUE)))

Answer 4

使用dplyr的未经测试的示例，如果它不起作用，可能会指向正确的方向。

library( dplyr )
set.seed(123)
df <- data.frame(
  group_one = as.integer( runif( 1000, 1, 6) ),
  group_two = sample( LETTERS[1:6], 1000, TRUE)
) %>%
  group_by( group_one, group_two ) %>%
  mutate(
    sample_membership = sample( seq(1, length(group_one) ), length(group_one), FALSE)
  )

祝你好运！

通过替换组生成随机数

4 个答案: