
时间:2017-11-08 21:44:10

标签: r


spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 
29, 30), Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT", 
"Marketing", "IT", "Marketing", "Marketing", "IT", "IT", "HR", 
"IT", "Marketing", "Marketing", "Marketing", "HR", "HR", "IT", 
"IT", "Marketing", "IT", "Marketing", "Marketing", "IT", "HR", 
"IT", "Marketing", "IT")), .Names = c("ID", "Dept"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -30L))


rating <- c("Outstanding", "Exceeds Expectation", "Achieves Expectations", "Needs Improvement")

我想在数据集中添加一个新列,我根据分布随机分配rating中的一个值。我希望5%的值为Outstanding,25%为Exceeds Expectation,67%为Achieves Expectations,3%为Needs Improvement,但在Dept下的每个组中。因此,每个Dept将随机分配这些值,但具有特定的分布。


spend$Rating <- sample(rating, nrow(spend), replace = TRUE)

head(spend, 10)
# A tibble: 10 x 3
      ID      Dept                Rating
   <dbl>     <chr>                 <chr>
 1     1        IT     Needs Improvement
 2     2        HR Achieves Expectations
 3     3 Marketing   Exceeds Expectation
 4     4        HR     Needs Improvement
 5     5        IT     Needs Improvement
 6     6        IT              Rockstar
 7     7 Marketing Achieves Expectations
 8     8        IT     Needs Improvement
 9     9 Marketing     Needs Improvement
10    10 Marketing   Exceeds Expectation


2 个答案:

答案 0 :(得分:1)

thelatemail指出sample()的{​​{1}}参数是正确的,Patricio Moracho的解决方案使用它来为您提供所需的整体发布,但是您需要在每个组中都有这个分布,所以这里有一个prob解决方案(因为你已经把它放在一个组中):




spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
                               15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
                               27, 28,  29, 30),
                        Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT",
                                 "Marketing", "IT", "Marketing", "Marketing",
                                 "IT", "IT", "HR", "IT", "Marketing",
                                 "Marketing", "Marketing", "HR", "HR", "IT",
                                 "IT", "Marketing", "IT", "Marketing",
                                 "Marketing", "IT", "HR", "IT", "Marketing",
                                 "IT")), .Names = c("ID", "Dept"),
                   class = c("tbl_df",  "tbl", "data.frame"),
                   row.names = c(NA, -30L))
rating <- c("Outstanding", "Exceeds Expectation",
            "Achieves Expectations", "Needs Improvement")
probs <- c(0.05, 0.25, 0.67, 0.03)

spend %>%
    group_by(Dept) %>% # note the group_by()
    mutate(Rating=sample(rating, size=n(), prob=probs, replace=TRUE)) %>%
    arrange(Dept) %>%

# A tibble: 30 x 3
# Groups:   Dept [3]
      ID      Dept                Rating
   <dbl>     <chr>                 <chr>
 1     2        HR Achieves Expectations
 2     4        HR   Exceeds Expectation
 3    13        HR Achieves Expectations
 4    18        HR   Exceeds Expectation
 5    19        HR           Outstanding
 6    27        HR Achieves Expectations
 7     1        IT Achieves Expectations
 8     5        IT   Exceeds Expectation
 9     6        IT Achieves Expectations
10     8        IT Achieves Expectations
11    11        IT           Outstanding
12    12        IT Achieves Expectations
13    14        IT   Exceeds Expectation
14    20        IT Achieves Expectations
15    21        IT Achieves Expectations
16    23        IT   Exceeds Expectation
17    26        IT Achieves Expectations
18    28        IT Achieves Expectations
19    30        IT Achieves Expectations
20     3 Marketing           Outstanding
21     7 Marketing   Exceeds Expectation
22     9 Marketing   Exceeds Expectation
23    10 Marketing Achieves Expectations
24    15 Marketing     Needs Improvement
25    16 Marketing Achieves Expectations
26    17 Marketing   Exceeds Expectation
27    22 Marketing Achieves Expectations
28    24 Marketing Achieves Expectations
29    25 Marketing Achieves Expectations
30    29 Marketing Achieves Expectations

答案 1 :(得分:0)


spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))


# Just for testing, a lot of more items to verify sample
spend <- do.call("rbind", replicate(100, spend, simplify = FALSE))

spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))
aggregate(spend$Rating, by=list(spend$Rating), function(x) length(x)/nrow(spend)*100)

                Group.1         x
1 Achieves Expectations 67.000667
2   Exceeds Expectation 24.996333
3     Needs Improvement  2.995333
4           Outstanding  5.007667