在R数据表中按组生成在n行之后递增的数字序列。

时间:2018-11-08 17:41:47

标签: r group-by data.table sequence

我有一个这样的数据表:

        customer_id account_id       time count
 1:           1        AAA 2000-01-01     0
 2:           1        AAA 2000-02-01     1
 3:           1        AAA 2000-03-01     2
 4:           1        AAA 2000-04-01     3
 5:           1        AAA 2000-05-01     4
 6:           1        AAA 2000-06-01     5
 7:           1        AAA 2000-07-01     6
 8:           1        AAA 2000-08-01     7
 9:           2        BBB 2008-01-01     0
10:           2        BBB 2008-02-01     1
11:           2        BBB 2008-03-01     2
12:           2        BBB 2008-04-01     3
13:           2        BBB 2008-05-01     4
14:           2        BBB 2008-06-01     5
15:           2        BBB 2008-07-01     6
16:           2        BBB 2008-08-01     7
17:           2        BBB 2008-09-01     8
18:           2        BBB 2008-10-01     9
19:           2        BBB 2008-11-01    10
20:           2        BBB 2008-12-01    11
21:           2        BBB 2009-01-01    12
22:           2        BBB 2009-02-01    13
23:           2        BBB 2009-03-01    14
24:           2        BBB 2009-04-01    15

用于创建此data.table的代码在此处:

customer_id <- c(rep(1,8), rep(2,16))
account_id <- c(rep("AAA",8), rep("BBB",16))
time <- c(seq(as.Date("2000/1/1"), by = "month", length.out = 8), 
seq(as.Date("2008/1/1"), by = "month", length.out = 16))

count <- c(seq(from = 0, to = 7), seq(from = 0, to = 15))

my_data <- data.table(customer_id,account_id,time,count)

我想生成一个名为new_var的新变量,如果变量count在1和4之间,则等于0;如果count在5和8之间,则为2;如果{ {1}} t在9到12之间,依此类推。也就是说,通过councustomer_id,我想创建一个新变量,该变量以1开头,每4个值后增加1数。看起来像这样:

account_id

对于等于0的计数,此新变量可以是例如NA,这无关紧要。有什么方法可以按组在此data.table中建立此序列(0,0,0,0,1,1,1,1,2,2,2,2,...)?

2 个答案:

答案 0 :(得分:2)

这是一个dplyr解决方案。 group_by您的customer_id,然后只需在ifelse中使用mutate语句来生成新变量。

library(dplyr)
my_data %>% group_by(customer_id,account_id) %>% mutate(new_var = ifelse(count==0,NA,floor((count-1)/4)))


# A tibble: 24 x 5
# Groups:   customer_id [2], account_id [1]
#   customer_id account_id time       count new_var
#         <dbl> <chr>      <date>     <int>   <dbl>
# 1           1 AAA        2000-01-01     0      NA
# 2           1 AAA        2000-02-01     1       0
# 3           1 AAA        2000-03-01     2       0
# 4           1 AAA        2000-04-01     3       0
# 5           1 AAA        2000-05-01     4       0
# 6           1 AAA        2000-06-01     5       1
# 7           1 AAA        2000-07-01     6       1
# 8           1 AAA        2000-08-01     7       1
# 9           2 BBB        2008-01-01     0      NA
#10           2 BBB        2008-02-01     1       0
# ... with 14 more rows

答案 1 :(得分:0)

这纯粹是'data.table'语法的解决方案:

my_data[, new_var:=ifelse(count==0, NA, floor((count-1)/4)), by=.(customer_id, account_id)]