我需要计算特定时间段内特定事件的发生次数。假设我有以下数据:
set.seed(1453)
id = c(1,1,1,1,1,2,2,2,2,2)
x_1 = sample(0:1, 10, TRUE)
x_5 = sample(0:1, 10, TRUE)
date = c('2016-01-01',
'2016-02-01',
'2016-02-23',
'2016-03-04',
'2016-04-01',
'2016-01-01',
'2016-02-01',
'2016-02-23',
'2016-03-04',
'2016-04-01'
)
df = data.frame(id,date=as.Date(date),snapshot_date = as.Date(date)+1,x_1,x_5)
表1.(输入)
id date snapshot_date x_1 x_5
1 2016-01-01 2016-01-02 1 0
1 2016-02-01 2016-02-02 0 1
1 2016-02-23 2016-02-24 1 1
1 2016-03-04 2016-03-05 0 0
1 2016-04-01 2016-04-02 0 1
2 2016-01-01 2016-01-02 1 1
2 2016-02-01 2016-02-02 1 0
2 2016-02-23 2016-02-24 0 0
2 2016-03-04 2016-03-05 0 0
2 2016-04-01 2016-04-02 1 1
我需要计算在过去3个月内(每个月)发生x_1 = 1和x_5 = 1的次数。所以我首先创建虚拟变量:如果x_1 = 1,则x_1_n = TRUE。否则,x_1_n = FALSE。同样,对于x_5_n。我还创建了三个月的倒退日期。
df$x_1_n <- ifelse((df$x_1 ==1), TRUE, FALSE)
df$x_5_n <- ifelse(df$x_5==1, TRUE, FALSE)
library(lubridate)
for (i in 1:3) {
DATE_MO <- as.Date(df$snapshot_date) %m-% months(i)
df[,paste0("DATE_MO", i)] <- DATE_MO
}
我有变量x_1,x_5。我需要编写一个遍历所有变量x_1,x_5的循环,并计算某些日期之间的出现次数。原始代码运行并且是正确的。但我想看看如何使用for循环简化它,这样我就不必手动复制粘贴每个x_1和x_5的代码,因为原始版本中x_和日期的数量更大。
library(data.table)
df <- data.table(df)
df[,c("x1_dminus_mo1",
"x1_dminus_mo2",
"x1_dminus_mo3"
) :=. (df[x_1_n][df[,.(id,DATE_MO1,snapshot_date)], on=.
(id, date >= DATE_MO1, date < snapshot_date), .N, by = .EACHI] $N
,
df[x_1_n][df[,.(id,DATE_MO2, DATE_MO1)], on=.
(id, date >= DATE_MO2, date < DATE_MO1), .N, by = .EACHI] $N
,
df[x_1_n][df[,.(id,DATE_MO3, DATE_MO2)], on=.
(id, date >= DATE_MO3, date < DATE_MO2), .N, by = .EACHI] $N
)]
df[,c("x5_dminus_mo1",
"x5_dminus_mo2",
"x5_dminus_mo3"
) :=. (df[x_5_n][df[,.(id,DATE_MO1,snapshot_date)], on=.
(id, date >= DATE_MO1, date < snapshot_date), .N, by = .EACHI] $N
,
df[x_5_n][df[,.(id,DATE_MO2, DATE_MO1)], on=.
(id, date >= DATE_MO2, date < DATE_MO1), .N, by = .EACHI] $N
,
df[x_5_n][df[,.(id,DATE_MO3, DATE_MO2)], on=.
(id, date >= DATE_MO3, date < DATE_MO2), .N, by = .EACHI] $N
)]
我想获得下表但使用循环。
表2(输出)
df[,c(1,2,4,11,12,13)]
id date x_1 x1_dminus_mo1 x1_dminus_mo2 x1_dminus_mo3
1 2016-01-01 1 1 0 0
1 2016-02-01 0 0 1 0
1 2016-02-23 1 1 1 0
1 2016-03-04 0 1 0 1
1 2016-04-01 0 0 1 0
2 2016-01-01 1 1 0 0
2 2016-02-01 1 1 1 0
2 2016-02-23 0 1 1 0
2 2016-03-04 0 0 1 1
2 2016-04-01 1 1 0 1
答案 0 :(得分:3)
感谢@Frank,我找到了正确的道路。这是解决方案:
for (i in c(1,5)){
col = paste0("x_",i)
df[,paste0("new_dminus_mo1_x", i)] <- df[, .SD[.(1), on=col]][df[,.(id,DATE_MO1,snapshot_date)], on=.
(id, date >= DATE_MO1, date < snapshot_date), .N, by = .EACHI] $N
df[,paste0("new_dminus_mo2_x",i)] <- df[, .SD[.(1), on=col]][df[,.(id,DATE_MO2,DATE_MO1)], on=.
(id, date >= DATE_MO2, date < DATE_MO1), .N, by = .EACHI] $N
df[,paste0("new_dminus_mo3_x",i)] <- df[, .SD[.(1), on=col]][df[,.(id,DATE_MO3,DATE_MO2)], on=.
(id, date >= DATE_MO3, date < DATE_MO2), .N, by = .EACHI] $N
}
答案 1 :(得分:3)
我会创建一个长格式表,以便每列只需要进行一次连接:
lookup = melt(DT[, lapply(0:3, function(x) snapshot_date %m-% months(x)), by=id],
id="id",
meas = list(2:4, 3:5),
value.name = c("d_up", "d_dn"))
lookup[, rn := rowid(variable), by=id]
cols = c("x_1", "x_5")
for (k in cols) lookup[, paste0("n_", k) :=
DT[.(1), on=k][.SD, on=.(id, date >= d_dn, date < d_up), .N, by=.EACHI]$N][]
id variable d_up d_dn rn n_x_1 n_x_5
1: 1 1 2016-01-02 2015-12-02 1 1 0
2: 1 1 2016-02-02 2016-01-02 2 0 1
3: 1 1 2016-02-24 2016-01-24 3 1 2
4: 1 1 2016-03-05 2016-02-05 4 1 1
5: 1 1 2016-04-02 2016-03-02 5 0 1
6: 2 1 2016-01-02 2015-12-02 1 1 1
7: 2 1 2016-02-02 2016-01-02 2 1 0
8: 2 1 2016-02-24 2016-01-24 3 1 0
9: 2 1 2016-03-05 2016-02-05 4 0 0
10: 2 1 2016-04-02 2016-03-02 5 1 1
11: 1 2 2015-12-02 2015-11-02 1 0 0
12: 1 2 2016-01-02 2015-12-02 2 1 0
13: 1 2 2016-01-24 2015-12-24 3 1 0
14: 1 2 2016-02-05 2016-01-05 4 0 1
15: 1 2 2016-03-02 2016-02-02 5 1 1
16: 2 2 2015-12-02 2015-11-02 1 0 0
17: 2 2 2016-01-02 2015-12-02 2 1 1
18: 2 2 2016-01-24 2015-12-24 3 1 1
19: 2 2 2016-02-05 2016-01-05 4 1 0
20: 2 2 2016-03-02 2016-02-02 5 0 0
21: 1 3 2015-11-02 2015-10-02 1 0 0
22: 1 3 2015-12-02 2015-11-02 2 0 0
23: 1 3 2015-12-24 2015-11-24 3 0 0
24: 1 3 2016-01-05 2015-12-05 4 1 0
25: 1 3 2016-02-02 2016-01-02 5 0 1
26: 2 3 2015-11-02 2015-10-02 1 0 0
27: 2 3 2015-12-02 2015-11-02 2 0 0
28: 2 3 2015-12-24 2015-11-24 3 0 0
29: 2 3 2016-01-05 2015-12-05 4 1 1
30: 2 3 2016-02-02 2016-01-02 5 1 0
id variable d_up d_dn rn n_x_1 n_x_5
meas = list(2:4, 3:5)
只是&#34;融化&#34;第2列:第4列合并为一列,类似于3:5。
长格式还有一个好处,即您不需要花费太多时间来处理包含类似数据的列的命名约定(&#34; DATE_MO [x]&#34;,&#34; new_dminus_mo [x] _x&#34;等)。
我更喜欢这种格式(使用单独的长格式表格),但我喜欢&#34; update join&#34;可以在这里从DT恢复列(具有重复值):
DT[, rn := rowid(id)]
DTcols = setdiff(names(DT), names(lookup))
lookup[DT, on=.(id, rn), (DTcols) := mget(paste0("i.", DTcols))]
id variable d_up d_dn rn n_x_1 n_x_5 date snapshot_date x_1 x_5
1: 1 1 2016-01-02 2015-12-02 1 1 0 2016-01-01 2016-01-02 1 0
2: 1 1 2016-02-02 2016-01-02 2 0 1 2016-02-01 2016-02-02 0 1
3: 1 1 2016-02-24 2016-01-24 3 1 2 2016-02-23 2016-02-24 1 1
4: 1 1 2016-03-05 2016-02-05 4 1 1 2016-03-04 2016-03-05 0 0
5: 1 1 2016-04-02 2016-03-02 5 0 1 2016-04-01 2016-04-02 0 1
6: 2 1 2016-01-02 2015-12-02 1 1 1 2016-01-01 2016-01-02 1 1
7: 2 1 2016-02-02 2016-01-02 2 1 0 2016-02-01 2016-02-02 1 0
8: 2 1 2016-02-24 2016-01-24 3 1 0 2016-02-23 2016-02-24 0 0
9: 2 1 2016-03-05 2016-02-05 4 0 0 2016-03-04 2016-03-05 0 0
10: 2 1 2016-04-02 2016-03-02 5 1 1 2016-04-01 2016-04-02 1 1
11: 1 2 2015-12-02 2015-11-02 1 0 0 2016-01-01 2016-01-02 1 0
12: 1 2 2016-01-02 2015-12-02 2 1 0 2016-02-01 2016-02-02 0 1
13: 1 2 2016-01-24 2015-12-24 3 1 0 2016-02-23 2016-02-24 1 1
14: 1 2 2016-02-05 2016-01-05 4 0 1 2016-03-04 2016-03-05 0 0
15: 1 2 2016-03-02 2016-02-02 5 1 1 2016-04-01 2016-04-02 0 1
16: 2 2 2015-12-02 2015-11-02 1 0 0 2016-01-01 2016-01-02 1 1
17: 2 2 2016-01-02 2015-12-02 2 1 1 2016-02-01 2016-02-02 1 0
18: 2 2 2016-01-24 2015-12-24 3 1 1 2016-02-23 2016-02-24 0 0
19: 2 2 2016-02-05 2016-01-05 4 1 0 2016-03-04 2016-03-05 0 0
20: 2 2 2016-03-02 2016-02-02 5 0 0 2016-04-01 2016-04-02 1 1
21: 1 3 2015-11-02 2015-10-02 1 0 0 2016-01-01 2016-01-02 1 0
22: 1 3 2015-12-02 2015-11-02 2 0 0 2016-02-01 2016-02-02 0 1
23: 1 3 2015-12-24 2015-11-24 3 0 0 2016-02-23 2016-02-24 1 1
24: 1 3 2016-01-05 2015-12-05 4 1 0 2016-03-04 2016-03-05 0 0
25: 1 3 2016-02-02 2016-01-02 5 0 1 2016-04-01 2016-04-02 0 1
26: 2 3 2015-11-02 2015-10-02 1 0 0 2016-01-01 2016-01-02 1 1
27: 2 3 2015-12-02 2015-11-02 2 0 0 2016-02-01 2016-02-02 1 0
28: 2 3 2015-12-24 2015-11-24 3 0 0 2016-02-23 2016-02-24 0 0
29: 2 3 2016-01-05 2015-12-05 4 1 1 2016-03-04 2016-03-05 0 0
30: 2 3 2016-02-02 2016-01-02 5 1 0 2016-04-01 2016-04-02 1 1
id variable d_up d_dn rn n_x_1 n_x_5 date snapshot_date x_1 x_5
或者将其重新整形为宽格式并更新连接回DT:
wDT = dcast(lookup, id + rn ~ variable, value.var = paste0("n_", cols))
id rn n_x_1_1 n_x_1_2 n_x_1_3 n_x_5_1 n_x_5_2 n_x_5_3
1: 1 1 1 0 0 0 0 0
2: 1 2 0 1 0 1 0 0
3: 1 3 1 1 0 2 0 0
4: 1 4 1 0 1 1 1 0
5: 1 5 0 1 0 1 1 1
6: 2 1 1 0 0 1 0 0
7: 2 2 1 1 0 0 1 0
8: 2 3 1 1 0 0 1 0
9: 2 4 0 1 1 0 0 1
10: 2 5 1 0 1 1 0 0
DT[, rn := rowid(id)]
wDTcols = setdiff(names(wDT), names(DT))
DT[wDT, on=.(id, rn), (wDTcols) := mget(paste0("i.", wDTcols))]
id date snapshot_date x_1 x_5 rn n_x_1_1 n_x_1_2 n_x_1_3 n_x_5_1 n_x_5_2 n_x_5_3
1: 1 2016-01-01 2016-01-02 1 0 1 1 0 0 0 0 0
2: 1 2016-02-01 2016-02-02 0 1 2 0 1 0 1 0 0
3: 1 2016-02-23 2016-02-24 1 1 3 1 1 0 2 0 0
4: 1 2016-03-04 2016-03-05 0 0 4 1 0 1 1 1 0
5: 1 2016-04-01 2016-04-02 0 1 5 0 1 0 1 1 1
6: 2 2016-01-01 2016-01-02 1 1 1 1 0 0 1 0 0
7: 2 2016-02-01 2016-02-02 1 0 2 1 1 0 0 1 0
8: 2 2016-02-23 2016-02-24 0 0 3 1 1 0 0 1 0
9: 2 2016-03-04 2016-03-05 0 0 4 0 1 1 0 0 1
10: 2 2016-04-01 2016-04-02 1 1 5 1 0 1 1 0 0