按组随机抽样并根据结果过滤

时间:2018-03-22 08:43:37

标签: r random dplyr

我有一个由以下代码生成的数据框

l_ids = c(1, 1, 1, 2, 2, 2, 2)
l_months = c(5, 5, 5, 88, 88, 88, 88)
l_calWeek = c(201708, 201709, 201710, 201741, 201742, 201743, 201744)
value = c(5, 6, 3, 99, 100, 1001, 1002)

dat <- setNames(data.frame(cbind(l_ids, l_months, l_calWeek, value)), 
c("ids", "months", "calWeek", "value"))

看起来像这样:

+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
|  1 |     5 |   201708 |   4.5 |
|  1 |     5 |   201709 |     5 |
| 1  |     5 |   201710 |     6 |
|  2 |    88 |   201741 |    75 |
|  2 |    88 | 201742   |    89 |
| 2  |    88 | 201743   |    90 |
|  2 |    88 |   201744 |    51 |
+----+-------+----------+-------+

我想从每个id-month组中随机抽样一个日历周(月份不是日历月)。然后我想在样本月之前保留所有id-month组合。

示例输出可以是:假设对于组id = 2而对于组id = 2和月份= 88和201709,对于组id = 1和月= 5,采样输出返回校准周201743,那么最终输出应该是

+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
|  1 |     5 |   201708 |   4.5 |
|  1 |     5 |   201709 |     5 |
|  2 |    88 |   201741 |    75 |
|  2 |    88 | 201742   |    89 |
   2  |    88 | 201743   |    90 |

+----+-------+----------+-------+

我尝试使用dplyr的sample_n函数(这将给我一个id-month组的随机日历周,但后来我不知道如何在该日期之前获得所有日历周。你能帮助我吗?如果可能的话,我想和dplyr合作。

如果您需要更多信息,请与我们联系。

非常感谢

2 个答案:

答案 0 :(得分:1)

应该这样做:

npm install --unsafe-perm

答案 1 :(得分:1)

require(dplyr) 
set.seed(1)     # when sampling please set.seed
sampled <- dat %>% group_by(ids) %>% do(., sample_n(.,1)) 

sampled_day <- sampled$calWeek

dat %>% group_by(ids) %>% 
  mutate(max_day = which(calWeek %in% sampled_day)) %>%
  filter(row_number() <= max_day)

#You can also just filter directly with row_number() <= which(calWeek %in% sampled_day)

# A tibble: 3 x 4
# Groups:   ids [2]
    ids months calWeek  value
  <dbl>  <dbl>   <dbl>  <dbl>
1  1.00   5.00  201708   5.00
2  2.00  88.0   201741  99.0 
3  2.00  88.0   201742 100 

这取决于行顺序!所以一定要先安排一天。但是,你需要考虑关系。我已编辑了之前的答案,只需使用&lt; =

进行过滤