在具有不同nrow的组内随机抽取样本

时间:2017-12-05 02:35:36

标签: r dplyr

如何从每个组具有不同行数的组中绘制n行?

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

我已经尝试了,

library(dplyr)
outdat <- df %>% 
  group_by(color) %>% 
  sample_n(nrow(.), replace = TRUE)
outdat

但是这会返回一个data.frame,其中nrow(.)是df的nrow而不是子集。

This SO post is close,但定义了特定数量的行绘制。我需要它专门针对dplyr中的组。

3 个答案:

答案 0 :(得分:4)

另一种解决方法是使用sample_frac

outdat <- df %>%
    group_by(color) %>%
    sample_frac(1, replace = TRUE)
outdat
# # A tibble: 40 x 3
# # Groups:   color [4]
#             X1          X2 color
#          <dbl>       <dbl> <chr>
#  1  0.69256186  0.97180252  blue
#  2  1.54384827 -0.20268802  blue
#  3 -1.20068240 -0.45402013  blue
#  4  2.63407877 -0.31644247  blue
#  5  1.20716737 -0.91380874  blue
#  6  0.01067475  1.02004679  blue
#  7  0.01067475  1.02004679  blue
#  8  1.79732108 -0.04072946  blue
#  9  0.01067475  1.02004679  blue
# 10  1.79732108 -0.04072946  blue
# # ... with 30 more rows

此外,使用outdat %>% ungroup()删除分组。

答案 1 :(得分:3)

使用slicesample.int的另一种解决方案。 重用来自www:

的数据
outdat <- df %>% 
group_by(color) %>% 
slice(sample.int(n(),replace=T))
outdat

            X1          X2  color
1   1.71506499 -1.12310858   blue
2   0.07050839  2.16895597   blue
3   0.46091621 -0.40288484   blue
4   0.07050839  2.16895597   blue
5   0.07050839  2.16895597   blue
6   1.71506499 -1.12310858   blue
7  -1.26506123 -0.46665535   blue
8   1.55870831 -1.26539635   blue
9   0.12928774  1.20796200   blue
10  1.55870831 -1.26539635   blue
11  0.55391765 -0.28477301   pink
12 -0.29507148 -2.30916888   pink
13 -0.30596266  0.18130348   pink
14 -0.06191171 -1.22071771   pink
15  0.55391765 -0.28477301   pink
16  0.55391765 -0.28477301   pink
17  0.87813349 -0.70920076   pink
18  0.68864025  1.02557137   pink
19 -0.30596266  0.18130348   pink
20  0.68864025  1.02557137   pink
21  0.70135590  0.12385424    red
22  0.11068272  1.36860228    red
23 -1.96661716  0.58461375    red
24  0.40077145 -0.04287046    red
25  1.78691314  1.51647060    red
26 -0.55584113 -0.22577099    red
27  0.40077145 -0.04287046    red
28  1.78691314  1.51647060    red
29 -0.47279141  0.21594157    red
30 -0.47279141  0.21594157    red
31 -1.02600445 -0.33320738 yellow
32 -0.72889123 -1.01857538 yellow
33  1.25381492  2.05008469 yellow
34  0.83778704  0.44820978 yellow
35  1.25381492  2.05008469 yellow
36 -0.62503927 -1.07179123 yellow
37 -0.62503927 -1.07179123 yellow
38  0.83778704  0.44820978 yellow
39 -0.21797491 -0.50232345 yellow
40 -1.68669331  0.30352864 yellow

答案 2 :(得分:2)

使用purrr pakcage的变通方法。似乎sample_n函数不能将n()作为size参数,可能是因为该参数不采用矢量化输入。但是,如果我们将数据框架按color分组,我们可以为每个组应用sample_n nrow()

# Set seed for reproducibility
set.seed(123)

# Create example data frame
df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

# Load packages
library(dplyr)
library(purrr)

outdat <- df %>%
  # Split the data frame by color
  split(.$color) %>%
  # Apply the sample_n function to all data frames
  map_dfr(~sample_n(., size = nrow(.), replace = TRUE))

outdat
#             X1          X2  color
# 1   1.71506499 -1.12310858   blue
# 2   0.07050839  2.16895597   blue
# 3   0.46091621 -0.40288484   blue
# 4   0.07050839  2.16895597   blue
# 5   0.07050839  2.16895597   blue
# 6   1.71506499 -1.12310858   blue
# 7  -1.26506123 -0.46665535   blue
# 8   1.55870831 -1.26539635   blue
# 9   0.12928774  1.20796200   blue
# 10  1.55870831 -1.26539635   blue
# 11  0.55391765 -0.28477301   pink
# 12 -0.29507148 -2.30916888   pink
# 13 -0.30596266  0.18130348   pink
# 14 -0.06191171 -1.22071771   pink
# 15  0.55391765 -0.28477301   pink
# 16  0.55391765 -0.28477301   pink
# 17  0.87813349 -0.70920076   pink
# 18  0.68864025  1.02557137   pink
# 19 -0.30596266  0.18130348   pink
# 20  0.68864025  1.02557137   pink
# 21  0.70135590  0.12385424    red
# 22  0.11068272  1.36860228    red
# 23 -1.96661716  0.58461375    red
# 24  0.40077145 -0.04287046    red
# 25  1.78691314  1.51647060    red
# 26 -0.55584113 -0.22577099    red
# 27  0.40077145 -0.04287046    red
# 28  1.78691314  1.51647060    red
# 29 -0.47279141  0.21594157    red
# 30 -0.47279141  0.21594157    red
# 31 -1.02600445 -0.33320738 yellow
# 32 -0.72889123 -1.01857538 yellow
# 33  1.25381492  2.05008469 yellow
# 34  0.83778704  0.44820978 yellow
# 35  1.25381492  2.05008469 yellow
# 36 -0.62503927 -1.07179123 yellow
# 37 -0.62503927 -1.07179123 yellow
# 38  0.83778704  0.44820978 yellow
# 39 -0.21797491 -0.50232345 yellow
# 40 -1.68669331  0.30352864 yellow