Question

如果这是一个愚蠢/明显的问题，我对R来说相对较新，所以道歉！我有兴趣创建一个新的数据集，该数据集由重新采样的行集合组成，并从更大的数据集中替换。

我的数据集看起来像这样，每个分组变量有多行。

> df <- data.frame(value=c(1:5,1:4,1:3),ID=c(rep(1,5),rep(2,4),rep(3,3)))
> df
   value ID
1      1  1
2      2  1
3      3  1
4      4  1
5      5  1
6      1  2
7      2  2
8      3  2
9      4  2
10     1  3
11     2  3
12     3  3

我想要做的是创建一个基于分组变量重新采样（替换）的新数据集。因此重新采样的数据集可能如下所示：

   value ID
1      1  1
2      2  1
3      3  1
4      4  1
5      5  1
6      1  3
7      2  3
8      3  3
9      1  1
10     2  1
11     3  1
12     4  1
13     5  1

感谢您的任何建议！

Answer 1

为了对每个ID值采样不同的行数，您可以尝试这样的事情（假设ID值具有少量唯一值）：

result <- NULL
result <- rbind(result, df[sample(row.names(df[df$ID == 1, ]), 10, replace = TRUE), ])
result <- rbind(result, df[sample(row.names(df[df$ID == 2, ]), 5, replace = TRUE), ])
result <- rbind(result, df[sample(row.names(df[df$ID == 3, ]), 3, replace = TRUE), ])
row.names(result) <- seq(1:nrow(result))

如果ID值很多，您可能需要使用一个循环，其中包含您希望的每个ID值的样本数。例如，如果有六个ID值，并且每个ID的相应样本数分别为10,5,3,7,8和2，则可以执行以下操作：

nsamples <- c(10, 5, 3, 7, 8, 2)
result <- NULL
for (i in 1:length(nsamples)) {
  result <- rbind(result, df[sample(row.names(df[df$ID == i, ]), nsamples[i], replace = TRUE), ])
}
row.names(result) <- seq(1:nrow(result))

在任何一种情况下，您最终都会得到如下输出：

   value ID
1      1  1
2      4  1
3      1  1
4      4  1
5      2  1
6      3  1
7      1  1
8      1  1
9      4  1
10     2  1
11     2  2
12     3  2
13     1  2
14     3  2
15     1  2
16     3  3
17     2  3
18     1  3

使用上面提到的dplyr解决方案，您还可以对每个ID值的可变数量的样本执行类似的操作（它还需要预先指定向量中每个对应ID的样本数量）：

library(dplyr)
nsamples <- c(10, 5, 3)
df %>% group_by(ID) %>% slice(sample(n(), nsamples[ID], replace = TRUE))

根据R中的分组变量重新取样行组

1 个答案: