如果连续值满足条件则进行分组

时间:2019-01-11 01:59:34

标签: r dplyr data.table

使用下表dat,我的目标是将{{1}和user_id分组为{strong>仅 }}。该序列必须在mobile_id中是连续的,并具有等级。将为每个单独的组分配一个增量值,例如:

difftime > - 600

将创建

的输出
created_at

除了> dat created_at user_id mobile_id status difftime 1 2019-01-02 22:01:38 1227604 68409 finished \\N 2 2019-01-03 04:08:29 1227604 68409 finished -366 3 2019-01-03 15:16:38 1227604 68409 timeout -668 4 2019-01-04 00:34:40 1227604 68409 failed -558 5 2019-01-04 00:27:37 1227605 68453 failed \\N 6 2019-01-04 00:35:56 1227605 68453 finished -8 7 2019-01-04 01:39:52 1227605 68453 finished -63 8 2019-01-04 02:05:53 1227605 68453 timeout -26 9 2019-01-04 02:17:17 1227605 68453 timeout -11 10 2019-01-04 16:51:39 1227605 68453 timeout -874 中的简单分组之外,我不确定从哪里开始。如何分配一个组和等级?

> output
            created_at user_id mobile_id   status difftime group rank
1  2019-01-02 22:01:38 1227604     68409 finished      \\N    NA   NA
2  2019-01-03 04:08:29 1227604     68409 finished     -366     1    1
3  2019-01-03 15:16:38 1227604     68409  timeout     -668    NA   NA
4  2019-01-04 00:34:40 1227604     68409   failed     -558     2    1
5  2019-01-04 00:27:37 1227605     68453   failed      \\N    NA   NA
6  2019-01-04 00:35:56 1227605     68453 finished       -8     3    1
7  2019-01-04 01:39:52 1227605     68453 finished      -63     3    2
8  2019-01-04 02:05:53 1227605     68453  timeout      -26     3    3
9  2019-01-04 02:17:17 1227605     68453  timeout      -11     3    4
10 2019-01-04 16:51:39 1227605     68453  timeout     -874    NA   NA

数据:

dplyr

1 个答案:

答案 0 :(得分:1)

您可以使用cumsum来定义一个变量,当基于同一组中的created_at观察不连续时,该变量会增加。通过对这个新变量进行分组,创建排名索引也很容易:

library("dplyr")
library("tidyr") ## for replace_na
dat2 <- dat %>%
  group_by(user_id, mobile_id) %>% 
  arrange(created_at, .by_group = TRUE) %>% ## grouped arrange
  mutate(d = cumsum(replace_na(difftime < -600, 0))) %>%
  group_by(user_id, mobile_id, d) %>%
  mutate(rank = row_number()-1) ## rank id

然后,最简单的创建组索引的方法是使用dplyr::group_indices

dat2$group <- group_indices(dat2 %>% ungroup, user_id, mobile_id, d)

我不确定为什么要将指标的第一个实例设置为NA,但是您可以基于rank的值进行设置。

> mutate(dat2, group = ifelse(rank == 0, NA, group),
+        rank = ifelse(rank == 0, NA, rank))
# A tibble: 10 x 8
# Groups:   user_id, mobile_id, d [4]
   created_at           user_id mobile_id status   difftime group rank     d
   <dttm>                 <dbl>     <int> <chr>       <dbl> <int> <dbl> <dbl>
 1 2019-01-02 22:01:38 1227604.     68409 finished      NA     NA   NA     0.
 2 2019-01-03 04:08:29 1227604.     68409 finished    -366.     1    1.    0.
 3 2019-01-03 15:16:38 1227604.     68409 timeout     -668.    NA   NA     1.
 4 2019-01-04 00:34:40 1227604.     68409 failed      -558.     2    1.    1.
 5 2019-01-04 00:27:37 1227605.     68453 failed        NA     NA   NA     0.
 6 2019-01-04 00:35:56 1227605.     68453 finished      -8.     3    1.    0.
 7 2019-01-04 01:39:52 1227605.     68453 finished     -63.     3    2.    0.
 8 2019-01-04 02:05:53 1227605.     68453 timeout      -26.     3    3.    0.
 9 2019-01-04 02:17:17 1227605.     68453 timeout      -11.     3    4.    0.
10 2019-01-04 16:51:39 1227605.     68453 timeout     -874.    NA   NA     1.