Question

我有一个类似下面的数据集：

spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 
29, 30), Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT", 
"Marketing", "IT", "Marketing", "Marketing", "IT", "IT", "HR", 
"IT", "Marketing", "Marketing", "Marketing", "HR", "HR", "IT", 
"IT", "Marketing", "IT", "Marketing", "Marketing", "IT", "HR", 
"IT", "Marketing", "IT")), .Names = c("ID", "Dept"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -30L))

我有一个像这样的列表：

rating <- c("Outstanding", "Exceeds Expectation", "Achieves Expectations", "Needs Improvement")

我想在数据集中添加一个新列，我根据分布随机分配rating中的一个值。我希望5％的值为Outstanding，25％为Exceeds Expectation，67％为Achieves Expectations，3％为Needs Improvement，但在Dept下的每个组中。因此，每个Dept将随机分配这些值，但具有特定的分布。

我无法使用sample函数获取特定的分发和分组。

spend$Rating <- sample(rating, nrow(spend), replace = TRUE)

head(spend, 10)
# A tibble: 10 x 3
      ID      Dept                Rating
   <dbl>     <chr>                 <chr>
 1     1        IT     Needs Improvement
 2     2        HR Achieves Expectations
 3     3 Marketing   Exceeds Expectation
 4     4        HR     Needs Improvement
 5     5        IT     Needs Improvement
 6     6        IT              Rockstar
 7     7 Marketing Achieves Expectations
 8     8        IT     Needs Improvement
 9     9 Marketing     Needs Improvement
10    10 Marketing   Exceeds Expectation

这显然不能维持组内的分布。对此有任何意见吗？

Answer 1

thelatemail指出sample()的{{1}}参数是正确的，Patricio Moracho的解决方案使用它来为您提供所需的整体发布，但是您需要在每个组中都有这个分布，所以这里有一个prob解决方案（因为你已经把它放在一个组中）：

dplyr

编辑：

我意识到您可能还希望通过这种方法在组内仔细检查频率是否会在更大的样本中抖动：

library(dplyr)
spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
                               15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
                               27, 28,  29, 30),
                        Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT",
                                 "Marketing", "IT", "Marketing", "Marketing",
                                 "IT", "IT", "HR", "IT", "Marketing",
                                 "Marketing", "Marketing", "HR", "HR", "IT",
                                 "IT", "Marketing", "IT", "Marketing",
                                 "Marketing", "IT", "HR", "IT", "Marketing",
                                 "IT")), .Names = c("ID", "Dept"),
                   class = c("tbl_df",  "tbl", "data.frame"),
                   row.names = c(NA, -30L))
rating <- c("Outstanding", "Exceeds Expectation",
            "Achieves Expectations", "Needs Improvement")
probs <- c(0.05, 0.25, 0.67, 0.03)

set.seed(123)
spend %>%
    group_by(Dept) %>% # note the group_by()
    mutate(Rating=sample(rating, size=n(), prob=probs, replace=TRUE)) %>%
    arrange(Dept) %>%
    print(n=nrow(.))

# A tibble: 30 x 3
# Groups:   Dept [3]
      ID      Dept                Rating
   <dbl>     <chr>                 <chr>
 1     2        HR Achieves Expectations
 2     4        HR   Exceeds Expectation
 3    13        HR Achieves Expectations
 4    18        HR   Exceeds Expectation
 5    19        HR           Outstanding
 6    27        HR Achieves Expectations
 7     1        IT Achieves Expectations
 8     5        IT   Exceeds Expectation
 9     6        IT Achieves Expectations
10     8        IT Achieves Expectations
11    11        IT           Outstanding
12    12        IT Achieves Expectations
13    14        IT   Exceeds Expectation
14    20        IT Achieves Expectations
15    21        IT Achieves Expectations
16    23        IT   Exceeds Expectation
17    26        IT Achieves Expectations
18    28        IT Achieves Expectations
19    30        IT Achieves Expectations
20     3 Marketing           Outstanding
21     7 Marketing   Exceeds Expectation
22     9 Marketing   Exceeds Expectation
23    10 Marketing Achieves Expectations
24    15 Marketing     Needs Improvement
25    16 Marketing Achieves Expectations
26    17 Marketing   Exceeds Expectation
27    22 Marketing Achieves Expectations
28    24 Marketing Achieves Expectations
29    25 Marketing Achieves Expectations
30    29 Marketing Achieves Expectations

Answer 2

如上所述，您可以使用sample()表示您想要的概率，如下所示：

spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))

测试：

# Just for testing, a lot of more items to verify sample
spend <- do.call("rbind", replicate(100, spend, simplify = FALSE))

set.seed(100)
spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))
aggregate(spend$Rating, by=list(spend$Rating), function(x) length(x)/nrow(spend)*100)

                Group.1         x
1 Achieves Expectations 67.000667
2   Exceeds Expectation 24.996333
3     Needs Improvement  2.995333
4           Outstanding  5.007667

随机分配R中组内特定分布的值

2 个答案: