Question

假设我有一个名为data的简单数据集：

customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
obs_date <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-03-01","%Y-%m-%d"),
          as.Date("2017-12-01","%Y-%m-%d"), as.Date("2018-01-01","%Y-%m-%d"), as.Date("2018-02-01","%Y-%m-%d"),
          as.Date("2018-03-01","%Y-%m-%d"), as.Date("2018-04-01","%Y-%m-%d"), as.Date("2018-05-01","%Y-%m-%d"),
          as.Date("2018-06-01","%Y-%m-%d"))
variable <- c(87,90,100,120,130,150,12,13,15,14)
data <- data.table(customer_id,account_id,obs_date,variable)

，我想添加另一个称为指标的变量，对于两个或多个连续观察日期（obs_date）变量<= 90的customer_id，account_id对等于1，否则为零。因此，第一个和第三个customer_id，account_id对的指标等于1，就像这样：

indicator <- c(1,1,1,0,0,0,0,1,1,1)
data <- data.table(customer_id,account_id,obs_date,variable, indicator)

您对如何创建称为指标的变量有任何想法吗？我需要按customer_id，account_id进行分组，并确定至少两个连续的时间段中变量<= 90的变量。非常感谢。

Answer 1

你可以做...

data[, v := with(rle(variable <= 90), 
  any(lengths >= 2 & values)
), by=.(customer_id, account_id)]

    customer_id account_id   obs_date variable indicator     v
 1:           1         11 2017-01-01       87         1  TRUE
 2:           1         11 2017-02-01       90         1  TRUE
 3:           1         11 2017-03-01      100         1  TRUE
 4:           2         55 2017-12-01      120         0 FALSE
 5:           2         55 2018-01-01      130         0 FALSE
 6:           2         55 2018-02-01      150         0 FALSE
 7:           2         55 2018-03-01       12         0 FALSE
 8:           3         38 2018-04-01       13         1  TRUE
 9:           3         38 2018-05-01       15         1  TRUE
10:           3         38 2018-06-01       14         1  TRUE

要查看其工作原理，请看以下简单的一行：

data[, rle(variable <= 90), by=.(customer_id, account_id)]

   customer_id account_id lengths values
1:           1         11       2   TRUE
2:           1         11       1  FALSE
3:           2         55       3  FALSE
4:           2         55       1   TRUE
5:           3         38       3   TRUE

Answer 2

您可以使用dplyr::lag()（或data.table::shift()）查看每行中variable的先前值，检查每行和先前行是否低于90，然后查看是否为每个组都是如此。

data[, indicator := max(variable <= 90 & lag(variable) <= 90, na.rm=T), 
     by=.(customer_id, account_id)]

data现在是：

    customer_id account_id   obs_date variable indicator
 1:           1         11 2017-01-01       87         1
 2:           1         11 2017-02-01       90         1
 3:           1         11 2017-03-01      100         1
 4:           2         55 2017-12-01      120         0
 5:           2         55 2018-01-01      130         0
 6:           2         55 2018-02-01      150         0
 7:           2         55 2018-03-01       12         0
 8:           3         38 2018-04-01       13         1
 9:           3         38 2018-05-01       15         1
10:           3         38 2018-06-01       14         1

为了说明正在发生的事情：

data[, .(obs_date, 
         variable, 
         lag = lag(variable),
         both_below = variable <= 90 & lag(variable) <= 90
       ), by=.(customer_id, account_id)]

输出：

    customer_id account_id   obs_date variable lag both_below
 1:           1         11 2017-01-01       87  NA         NA
 2:           1         11 2017-02-01       90  87       TRUE
 3:           1         11 2017-03-01      100  90      FALSE
 4:           2         55 2017-12-01      120  NA      FALSE
 5:           2         55 2018-01-01      130 120      FALSE
 6:           2         55 2018-02-01      150 130      FALSE
 7:           2         55 2018-03-01       12 150      FALSE
 8:           3         38 2018-04-01       13  NA         NA
 9:           3         38 2018-05-01       15  13       TRUE
10:           3         38 2018-06-01       14  15       TRUE

如何确定在R

2 个答案: