如何确定在R

时间:2018-09-25 18:08:31

标签: r data.table

假设我有一个名为data的简单数据集:

customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
obs_date <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-03-01","%Y-%m-%d"),
          as.Date("2017-12-01","%Y-%m-%d"), as.Date("2018-01-01","%Y-%m-%d"), as.Date("2018-02-01","%Y-%m-%d"),
          as.Date("2018-03-01","%Y-%m-%d"), as.Date("2018-04-01","%Y-%m-%d"), as.Date("2018-05-01","%Y-%m-%d"),
          as.Date("2018-06-01","%Y-%m-%d"))
variable <- c(87,90,100,120,130,150,12,13,15,14)
data <- data.table(customer_id,account_id,obs_date,variable)

,我想添加另一个称为指标的变量,对于两个或多个连续观察日期(obs_date)变量<= 90的customer_id,account_id对等于1,否则为零。因此,第一个和第三个customer_id,account_id对的指标等于1,就像这样:

indicator <- c(1,1,1,0,0,0,0,1,1,1)
data <- data.table(customer_id,account_id,obs_date,variable, indicator)

您对如何创建称为指标的变量有任何想法吗?我需要按customer_id,account_id进行分组,并确定至少两个连续的时间段中变量<= 90的变量。 非常感谢。

2 个答案:

答案 0 :(得分:4)

你可以做...

data[, v := with(rle(variable <= 90), 
  any(lengths >= 2 & values)
), by=.(customer_id, account_id)]

    customer_id account_id   obs_date variable indicator     v
 1:           1         11 2017-01-01       87         1  TRUE
 2:           1         11 2017-02-01       90         1  TRUE
 3:           1         11 2017-03-01      100         1  TRUE
 4:           2         55 2017-12-01      120         0 FALSE
 5:           2         55 2018-01-01      130         0 FALSE
 6:           2         55 2018-02-01      150         0 FALSE
 7:           2         55 2018-03-01       12         0 FALSE
 8:           3         38 2018-04-01       13         1  TRUE
 9:           3         38 2018-05-01       15         1  TRUE
10:           3         38 2018-06-01       14         1  TRUE

要查看其工作原理,请看以下简单的一行:

data[, rle(variable <= 90), by=.(customer_id, account_id)]

   customer_id account_id lengths values
1:           1         11       2   TRUE
2:           1         11       1  FALSE
3:           2         55       3  FALSE
4:           2         55       1   TRUE
5:           3         38       3   TRUE

答案 1 :(得分:4)

您可以使用dplyr::lag()(或data.table::shift())查看每行中variable的先前值,检查每行和先前行是否低于90,然后查看是否为每个组都是如此。

data[, indicator := max(variable <= 90 & lag(variable) <= 90, na.rm=T), 
     by=.(customer_id, account_id)]

data现在是:

    customer_id account_id   obs_date variable indicator
 1:           1         11 2017-01-01       87         1
 2:           1         11 2017-02-01       90         1
 3:           1         11 2017-03-01      100         1
 4:           2         55 2017-12-01      120         0
 5:           2         55 2018-01-01      130         0
 6:           2         55 2018-02-01      150         0
 7:           2         55 2018-03-01       12         0
 8:           3         38 2018-04-01       13         1
 9:           3         38 2018-05-01       15         1
10:           3         38 2018-06-01       14         1

为了说明正在发生的事情:

data[, .(obs_date, 
         variable, 
         lag = lag(variable),
         both_below = variable <= 90 & lag(variable) <= 90
       ), by=.(customer_id, account_id)]

输出:

    customer_id account_id   obs_date variable lag both_below
 1:           1         11 2017-01-01       87  NA         NA
 2:           1         11 2017-02-01       90  87       TRUE
 3:           1         11 2017-03-01      100  90      FALSE
 4:           2         55 2017-12-01      120  NA      FALSE
 5:           2         55 2018-01-01      130 120      FALSE
 6:           2         55 2018-02-01      150 130      FALSE
 7:           2         55 2018-03-01       12 150      FALSE
 8:           3         38 2018-04-01       13  NA         NA
 9:           3         38 2018-05-01       15  13       TRUE
10:           3         38 2018-06-01       14  15       TRUE