我有一个由以下代码生成的数据框
l_ids = c(1, 1, 1, 2, 2, 2, 2)
l_months = c(5, 5, 5, 88, 88, 88, 88)
l_calWeek = c(201708, 201709, 201710, 201741, 201742, 201743, 201744)
value = c(5, 6, 3, 99, 100, 1001, 1002)
dat <- setNames(data.frame(cbind(l_ids, l_months, l_calWeek, value)),
c("ids", "months", "calWeek", "value"))
看起来像这样:
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 1 | 5 | 201710 | 6 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
| 2 | 88 | 201743 | 90 |
| 2 | 88 | 201744 | 51 |
+----+-------+----------+-------+
我想从每个id-month组中随机抽样一个日历周(月份不是日历月)。然后我想在样本月之前保留所有id-month组合。
示例输出可以是:假设对于组id = 2而对于组id = 2和月份= 88和201709,对于组id = 1和月= 5,采样输出返回校准周201743,那么最终输出应该是
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
2 | 88 | 201743 | 90 |
+----+-------+----------+-------+
我尝试使用dplyr的sample_n函数(这将给我一个id-month组的随机日历周,但后来我不知道如何在该日期之前获得所有日历周。你能帮助我吗?如果可能的话,我想和dplyr合作。
如果您需要更多信息,请与我们联系。
非常感谢
答案 0 :(得分:1)
应该这样做:
npm install --unsafe-perm
答案 1 :(得分:1)
require(dplyr)
set.seed(1) # when sampling please set.seed
sampled <- dat %>% group_by(ids) %>% do(., sample_n(.,1))
sampled_day <- sampled$calWeek
dat %>% group_by(ids) %>%
mutate(max_day = which(calWeek %in% sampled_day)) %>%
filter(row_number() <= max_day)
#You can also just filter directly with row_number() <= which(calWeek %in% sampled_day)
# A tibble: 3 x 4
# Groups: ids [2]
ids months calWeek value
<dbl> <dbl> <dbl> <dbl>
1 1.00 5.00 201708 5.00
2 2.00 88.0 201741 99.0
3 2.00 88.0 201742 100
这取决于行顺序!所以一定要先安排一天。但是,你需要考虑关系。我已编辑了之前的答案,只需使用&lt; =
进行过滤