我正在尝试基于组获取相对较大数据帧的随机样本。我只需要为每个小组成员获得唯一的结果-我不能为单个成员或整体重复结果。
我已成功将此代码用于小样本:
processors2 <- processors%>%filter(str_detect(Person.Who.Changed.Object, "A0")) %>%
group_by(User)%>% sample_n(., 2)
但是,如果我使用下面的类似代码,则在组内和总体上都将获得多个重复项(即成员1和成员3获得同一行数据,成员1获得其不同行中的2个)。
claimallocator2 <- claimallocator%>%
group_by(User)%>% sample_n(80, weight = Claim.Amt)
此外,如果我添加replace = FALSE,也没有什么区别。我仍在重复。
预期的输出(显然规模要小得多)
User Warranty.Claim Claim.amt
User 1 1 500
User 1 2 1000
User 1 3 1500
User 1 4 2000
User 1 5 2500
User 2 6 3000
User 2 7 3500
User 2 8 4000
User 2 9 4500
User 2 10 5000
User 2 11 5500
User 2 12 6000
User 3 13 6500
User 3 14 7000
User 3 15 7500
User 3 16 8000
User 3 17 8500
User 3 18 9000
User 3 19 9500
User 3 20 10000
User 3 21 10500
User 3 22 11000
我实际上得到了什么:
User Warranty.Claim Claim.amt
User 1 1 500
User 1 1 500
User 1 3 1500
User 1 4 2000
User 1 5 2500
User 2 6 3000
User 2 7 3500
User 2 8 4000
User 2 9 4500
User 2 10 5000
User 2 11 5500
User 2 12 6000
User 3 13 6500
User 3 14 7000
User 3 15 7500
User 3 16 8000
User 3 17 8500
User 3 18 9000
User 3 19 9500
User 3 8 4000
User 3 21 10500
User 3 22 11000
答案 0 :(得分:1)
尝试这种方法:首先删除重复的行,然后按用户分组并抽样所需数量的案例。
# create toy data
df <- data.frame(user=sample(1:10,1000,T),
warranty=sample(1:10,1000,T),
claim=sample(1:10,1000,T))
# count number of duplicate user-warranty-claim trios
df %>% count(user,warranty,claim) %>% arrange(desc(n))
# remove duplicates, sample 2 cases per user
df %>% group_by(user,warranty,claim) %>% slice(1) %>%
ungroup() %>% group_by(user) %>% sample_n(2)
答案 1 :(得分:-1)
您可以选中replace
函数中的sample_n()
选项