dplyr唯一行sample_n

时间:2019-07-23 17:49:04

标签: r dplyr

我正在尝试基于组获取相对较大数据帧的随机样本。我只需要为每个小组成员获得唯一的结果-我不能为单个成员或整体重复结果。

我已成功将此代码用于小样本:

    processors2 <- processors%>%filter(str_detect(Person.Who.Changed.Object, "A0")) %>% 
      group_by(User)%>% sample_n(., 2)

但是,如果我使用下面的类似代码,则在组内和总体上都将获得多个重复项(即成员1和成员3获得同一行数据,成员1获得其不同行中的2个)。

claimallocator2 <- claimallocator%>%
  group_by(User)%>% sample_n(80, weight = Claim.Amt)

此外,如果我添加replace = FALSE,也没有什么区别。我仍在重复。

预期的输出(显然规模要小得多)

User    Warranty.Claim  Claim.amt
User 1  1   500
User 1  2   1000
User 1  3   1500
User 1  4   2000
User 1  5   2500
User 2  6   3000
User 2  7   3500
User 2  8   4000
User 2  9   4500
User 2  10  5000
User 2  11  5500
User 2  12  6000
User 3  13  6500
User 3  14  7000
User 3  15  7500
User 3  16  8000
User 3  17  8500
User 3  18  9000
User 3  19  9500
User 3  20  10000
User 3  21  10500
User 3  22  11000

我实际上得到了什么:

    User    Warranty.Claim  Claim.amt
    User 1  1   500
    User 1  1   500
    User 1  3   1500
    User 1  4   2000
    User 1  5   2500
    User 2  6   3000
    User 2  7   3500
    User 2  8   4000
    User 2  9   4500
    User 2  10  5000
    User 2  11  5500
    User 2  12  6000
    User 3  13  6500
    User 3  14  7000
    User 3  15  7500
    User 3  16  8000
    User 3  17  8500
    User 3  18  9000
    User 3  19  9500
    User 3  8   4000
    User 3  21  10500
    User 3  22  11000

2 个答案:

答案 0 :(得分:1)

尝试这种方法:首先删除重复的行,然后按用户分组并抽样所需数量的案例。

# create toy data
df <- data.frame(user=sample(1:10,1000,T),
                 warranty=sample(1:10,1000,T),
                 claim=sample(1:10,1000,T))

# count number of duplicate user-warranty-claim trios
df %>% count(user,warranty,claim) %>% arrange(desc(n))

# remove duplicates, sample 2 cases per user
df %>% group_by(user,warranty,claim) %>% slice(1) %>% 
  ungroup() %>% group_by(user) %>% sample_n(2)

答案 1 :(得分:-1)

您可以选中replace函数中的sample_n()选项