假设我有一个名为data的简单数据集:
customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
obs_date <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-03-01","%Y-%m-%d"),
as.Date("2017-12-01","%Y-%m-%d"), as.Date("2018-01-01","%Y-%m-%d"), as.Date("2018-02-01","%Y-%m-%d"),
as.Date("2018-03-01","%Y-%m-%d"), as.Date("2018-04-01","%Y-%m-%d"), as.Date("2018-05-01","%Y-%m-%d"),
as.Date("2018-06-01","%Y-%m-%d"))
variable <- c(87,90,100,120,130,150,12,13,15,14)
data <- data.table(customer_id,account_id,obs_date,variable)
,我想添加另一个称为指标的变量,对于两个或多个连续观察日期(obs_date)变量<= 90的customer_id,account_id对等于1,否则为零。因此,第一个和第三个customer_id,account_id对的指标等于1,就像这样:
indicator <- c(1,1,1,0,0,0,0,1,1,1)
data <- data.table(customer_id,account_id,obs_date,variable, indicator)
您对如何创建称为指标的变量有任何想法吗?我需要按customer_id,account_id进行分组,并确定至少两个连续的时间段中变量<= 90的变量。 非常感谢。
答案 0 :(得分:4)
你可以做...
data[, v := with(rle(variable <= 90),
any(lengths >= 2 & values)
), by=.(customer_id, account_id)]
customer_id account_id obs_date variable indicator v
1: 1 11 2017-01-01 87 1 TRUE
2: 1 11 2017-02-01 90 1 TRUE
3: 1 11 2017-03-01 100 1 TRUE
4: 2 55 2017-12-01 120 0 FALSE
5: 2 55 2018-01-01 130 0 FALSE
6: 2 55 2018-02-01 150 0 FALSE
7: 2 55 2018-03-01 12 0 FALSE
8: 3 38 2018-04-01 13 1 TRUE
9: 3 38 2018-05-01 15 1 TRUE
10: 3 38 2018-06-01 14 1 TRUE
要查看其工作原理,请看以下简单的一行:
data[, rle(variable <= 90), by=.(customer_id, account_id)]
customer_id account_id lengths values
1: 1 11 2 TRUE
2: 1 11 1 FALSE
3: 2 55 3 FALSE
4: 2 55 1 TRUE
5: 3 38 3 TRUE
答案 1 :(得分:4)
您可以使用dplyr::lag()
(或data.table::shift()
)查看每行中variable
的先前值,检查每行和先前行是否低于90,然后查看是否为每个组都是如此。
data[, indicator := max(variable <= 90 & lag(variable) <= 90, na.rm=T),
by=.(customer_id, account_id)]
data
现在是:
customer_id account_id obs_date variable indicator
1: 1 11 2017-01-01 87 1
2: 1 11 2017-02-01 90 1
3: 1 11 2017-03-01 100 1
4: 2 55 2017-12-01 120 0
5: 2 55 2018-01-01 130 0
6: 2 55 2018-02-01 150 0
7: 2 55 2018-03-01 12 0
8: 3 38 2018-04-01 13 1
9: 3 38 2018-05-01 15 1
10: 3 38 2018-06-01 14 1
为了说明正在发生的事情:
data[, .(obs_date,
variable,
lag = lag(variable),
both_below = variable <= 90 & lag(variable) <= 90
), by=.(customer_id, account_id)]
输出:
customer_id account_id obs_date variable lag both_below
1: 1 11 2017-01-01 87 NA NA
2: 1 11 2017-02-01 90 87 TRUE
3: 1 11 2017-03-01 100 90 FALSE
4: 2 55 2017-12-01 120 NA FALSE
5: 2 55 2018-01-01 130 120 FALSE
6: 2 55 2018-02-01 150 130 FALSE
7: 2 55 2018-03-01 12 150 FALSE
8: 3 38 2018-04-01 13 NA NA
9: 3 38 2018-05-01 15 13 TRUE
10: 3 38 2018-06-01 14 15 TRUE