我有一个类似下面的数据集:
spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30), Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT",
"Marketing", "IT", "Marketing", "Marketing", "IT", "IT", "HR",
"IT", "Marketing", "Marketing", "Marketing", "HR", "HR", "IT",
"IT", "Marketing", "IT", "Marketing", "Marketing", "IT", "HR",
"IT", "Marketing", "IT")), .Names = c("ID", "Dept"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -30L))
我有一个像这样的列表:
rating <- c("Outstanding", "Exceeds Expectation", "Achieves Expectations", "Needs Improvement")
我想在数据集中添加一个新列,我根据分布随机分配rating
中的一个值。我希望5%的值为Outstanding
,25%为Exceeds Expectation
,67%为Achieves Expectations
,3%为Needs Improvement
,但在Dept
下的每个组中。因此,每个Dept
将随机分配这些值,但具有特定的分布。
我无法使用sample
函数获取特定的分发和分组。
spend$Rating <- sample(rating, nrow(spend), replace = TRUE)
head(spend, 10)
# A tibble: 10 x 3
ID Dept Rating
<dbl> <chr> <chr>
1 1 IT Needs Improvement
2 2 HR Achieves Expectations
3 3 Marketing Exceeds Expectation
4 4 HR Needs Improvement
5 5 IT Needs Improvement
6 6 IT Rockstar
7 7 Marketing Achieves Expectations
8 8 IT Needs Improvement
9 9 Marketing Needs Improvement
10 10 Marketing Exceeds Expectation
这显然不能维持组内的分布。对此有任何意见吗?
答案 0 :(得分:1)
thelatemail指出sample()
的{{1}}参数是正确的,Patricio Moracho的解决方案使用它来为您提供所需的整体发布,但是您需要在每个组中都有这个分布,所以这里有一个prob
解决方案(因为你已经把它放在一个组中):
dplyr
编辑:
我意识到您可能还希望通过这种方法在组内仔细检查频率是否会在更大的样本中抖动:
library(dplyr)
spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30),
Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT",
"Marketing", "IT", "Marketing", "Marketing",
"IT", "IT", "HR", "IT", "Marketing",
"Marketing", "Marketing", "HR", "HR", "IT",
"IT", "Marketing", "IT", "Marketing",
"Marketing", "IT", "HR", "IT", "Marketing",
"IT")), .Names = c("ID", "Dept"),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -30L))
rating <- c("Outstanding", "Exceeds Expectation",
"Achieves Expectations", "Needs Improvement")
probs <- c(0.05, 0.25, 0.67, 0.03)
set.seed(123)
spend %>%
group_by(Dept) %>% # note the group_by()
mutate(Rating=sample(rating, size=n(), prob=probs, replace=TRUE)) %>%
arrange(Dept) %>%
print(n=nrow(.))
# A tibble: 30 x 3
# Groups: Dept [3]
ID Dept Rating
<dbl> <chr> <chr>
1 2 HR Achieves Expectations
2 4 HR Exceeds Expectation
3 13 HR Achieves Expectations
4 18 HR Exceeds Expectation
5 19 HR Outstanding
6 27 HR Achieves Expectations
7 1 IT Achieves Expectations
8 5 IT Exceeds Expectation
9 6 IT Achieves Expectations
10 8 IT Achieves Expectations
11 11 IT Outstanding
12 12 IT Achieves Expectations
13 14 IT Exceeds Expectation
14 20 IT Achieves Expectations
15 21 IT Achieves Expectations
16 23 IT Exceeds Expectation
17 26 IT Achieves Expectations
18 28 IT Achieves Expectations
19 30 IT Achieves Expectations
20 3 Marketing Outstanding
21 7 Marketing Exceeds Expectation
22 9 Marketing Exceeds Expectation
23 10 Marketing Achieves Expectations
24 15 Marketing Needs Improvement
25 16 Marketing Achieves Expectations
26 17 Marketing Exceeds Expectation
27 22 Marketing Achieves Expectations
28 24 Marketing Achieves Expectations
29 25 Marketing Achieves Expectations
30 29 Marketing Achieves Expectations
答案 1 :(得分:0)
如上所述,您可以使用sample()
表示您想要的概率,如下所示:
spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))
测试:
# Just for testing, a lot of more items to verify sample
spend <- do.call("rbind", replicate(100, spend, simplify = FALSE))
set.seed(100)
spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))
aggregate(spend$Rating, by=list(spend$Rating), function(x) length(x)/nrow(spend)*100)
Group.1 x
1 Achieves Expectations 67.000667
2 Exceeds Expectation 24.996333
3 Needs Improvement 2.995333
4 Outstanding 5.007667