使用下表dat
,我的目标是将{{1}和user_id
分组为{strong>仅 }}。该序列必须在mobile_id
中是连续的,并具有等级。将为每个单独的组分配一个增量值,例如:
difftime > - 600
将创建
的输出created_at
除了> dat
created_at user_id mobile_id status difftime
1 2019-01-02 22:01:38 1227604 68409 finished \\N
2 2019-01-03 04:08:29 1227604 68409 finished -366
3 2019-01-03 15:16:38 1227604 68409 timeout -668
4 2019-01-04 00:34:40 1227604 68409 failed -558
5 2019-01-04 00:27:37 1227605 68453 failed \\N
6 2019-01-04 00:35:56 1227605 68453 finished -8
7 2019-01-04 01:39:52 1227605 68453 finished -63
8 2019-01-04 02:05:53 1227605 68453 timeout -26
9 2019-01-04 02:17:17 1227605 68453 timeout -11
10 2019-01-04 16:51:39 1227605 68453 timeout -874
中的简单分组之外,我不确定从哪里开始。如何分配一个组和等级?
> output
created_at user_id mobile_id status difftime group rank
1 2019-01-02 22:01:38 1227604 68409 finished \\N NA NA
2 2019-01-03 04:08:29 1227604 68409 finished -366 1 1
3 2019-01-03 15:16:38 1227604 68409 timeout -668 NA NA
4 2019-01-04 00:34:40 1227604 68409 failed -558 2 1
5 2019-01-04 00:27:37 1227605 68453 failed \\N NA NA
6 2019-01-04 00:35:56 1227605 68453 finished -8 3 1
7 2019-01-04 01:39:52 1227605 68453 finished -63 3 2
8 2019-01-04 02:05:53 1227605 68453 timeout -26 3 3
9 2019-01-04 02:17:17 1227605 68453 timeout -11 3 4
10 2019-01-04 16:51:39 1227605 68453 timeout -874 NA NA
数据:
dplyr
答案 0 :(得分:1)
您可以使用cumsum
来定义一个变量,当基于同一组中的created_at
观察不连续时,该变量会增加。通过对这个新变量进行分组,创建排名索引也很容易:
library("dplyr")
library("tidyr") ## for replace_na
dat2 <- dat %>%
group_by(user_id, mobile_id) %>%
arrange(created_at, .by_group = TRUE) %>% ## grouped arrange
mutate(d = cumsum(replace_na(difftime < -600, 0))) %>%
group_by(user_id, mobile_id, d) %>%
mutate(rank = row_number()-1) ## rank id
然后,最简单的创建组索引的方法是使用dplyr::group_indices
:
dat2$group <- group_indices(dat2 %>% ungroup, user_id, mobile_id, d)
我不确定为什么要将指标的第一个实例设置为NA
,但是您可以基于rank
的值进行设置。
> mutate(dat2, group = ifelse(rank == 0, NA, group),
+ rank = ifelse(rank == 0, NA, rank))
# A tibble: 10 x 8
# Groups: user_id, mobile_id, d [4]
created_at user_id mobile_id status difftime group rank d
<dttm> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 2019-01-02 22:01:38 1227604. 68409 finished NA NA NA 0.
2 2019-01-03 04:08:29 1227604. 68409 finished -366. 1 1. 0.
3 2019-01-03 15:16:38 1227604. 68409 timeout -668. NA NA 1.
4 2019-01-04 00:34:40 1227604. 68409 failed -558. 2 1. 1.
5 2019-01-04 00:27:37 1227605. 68453 failed NA NA NA 0.
6 2019-01-04 00:35:56 1227605. 68453 finished -8. 3 1. 0.
7 2019-01-04 01:39:52 1227605. 68453 finished -63. 3 2. 0.
8 2019-01-04 02:05:53 1227605. 68453 timeout -26. 3 3. 0.
9 2019-01-04 02:17:17 1227605. 68453 timeout -11. 3 4. 0.
10 2019-01-04 16:51:39 1227605. 68453 timeout -874. NA NA 1.