我有虚拟数据
structure(list(id = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 7,
7), policy_num = c(41551662L, 50966414L, 43077202L, 46927463L,
57130236L, 57050065L, 26196559L, 33545119L, 52304024L, 73953064L,
50340507L, 50491162L, 76577511L, 108067534L), product = c("apple",
"apple", "pear", "apple", "apple", "apple", "plum", "apple",
"pear", "apple", "apple", "apple", "pear", "pear"), start_date =
structure(c(13607, 15434, 14276, 15294, 15660, 15660, 10547, 15117, 15483,
16351, 15429, 15421, 16474, 17205), class = "Date"), end_date = structure(c(15068,
16164, 17563, 15660, 15660, 16390, 13834, 16234, 17674, 17447,
15794, 15786, 17205, 17570), class = "Date")), .Names = c("id",
"policy_num", "product", "start_date", "end_date"), row.names = c(NA,
-14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)
id policy_num product start_date end_date
1 41551662 apple 2007-04-04 2011-04-04
1 50966414 apple 2012-04-04 2014-04-04
2 43077202 pear 2009-02-01 2018-02-01
3 46927463 apple 2011-11-16 2012-11-16
3 57130236 apple 2012-11-16 2012-11-16
3 57050065 apple 2012-11-16 2014-11-16
4 26196559 plum 1998-11-17 2007-11-17
5 33545119 apple 2011-05-23 2014-06-13
5 52304024 pear 2012-05-23 2018-05-23
5 73953064 apple 2014-10-08 2017-10-08
6 50340507 apple 2012-03-30 2013-03-30
7 50491162 apple 2012-03-22 2013-03-22
7 76577511 pear 2015-02-08 2017-02-08
7 108067534 pear 2017-02-08 2018-02-08
基于此,我想计算以下变量(按user_id分组):
1)当前持有的产品数量(no_prod_now
) - 不同产品的数量,end_date
&gt;目前已评估start_date
。简单地说,user_id
start_date
所持有的产品数量
2)当前持有的有效政策数量(no_policies_now
) - 如上所述,但适用于policy_num
3)在当前start_date
(policies_open_3mo
)之前3个月内开立的政策数量
4)policies_closed_3mo
- 如上所述,但过去3个月内已关闭的政策数量
理想的输出如下:
id policy_num product start_date end_date no_prod_now no_policies_now policies_closed_3mo
1 41551662 apple 2007-04-04 2011-04-04 1 1 0
1 50966414 apple 2012-04-04 2014-04-04 1 1 0
2 43077202 pear 2009-02-01 2018-02-01 1 1 0
3 46927463 apple 2011-11-16 2012-11-16 1 1 0
3 57130236 apple 2012-11-16 2012-11-16 1 1 1
3 57050065 apple 2012-11-16 2014-11-16 1 1 2
4 26196559 plum 1998-11-17 2007-11-17 1 1 0
5 33545119 apple 2011-05-23 2014-06-13 1 1 0
5 52304024 pear 2012-05-23 2018-05-23 2 2 0
5 73953064 apple 2014-10-08 2017-10-08 2 2 0
6 50340507 apple 2012-03-30 2013-03-30 1 1 0
7 50491162 apple 2012-03-22 2013-03-22 1 1 0
7 76577511 pear 2015-02-08 2017-02-08 1 1 0
7 108067534 pear 2017-02-08 2018-02-08 1 1 1
policies_open_3mo
0
0
0
0
0
1
0
0
1
0
0
0
0
0
我正在寻找理想情况下在data.table
中实施的解决方案,因为我会将其应用于大数据量,但base R
或dplyr
解决方案我总是可以转换为data.table
,o也很有价值,谢谢!
答案 0 :(得分:1)
这非常棘手,但可以通过一些非等自连接来解决。
修改:事实证明更新加入并不能与非等自我加入一起工作曾预料到(见here)。因此,我必须完全修改代码以避免更新到位。
相反,另外四列是由三个独立的非等自连接创建的,并且会合并为最终结果。
library(data.table)
library(lubridate)
result <-
# create helper column for previous three months periods.
# lubridate's month arithmetic avoids NAs at end of month, e.g., February
DT[, start_date_3mo := start_date %m-% period(month = 3L)][
# start "cbind()" with original columns
, c(.SD,
# count number of products and policies held at time of start_date
DT[DT, on = c("id", "start_date<=start_date", "end_date>start_date"),
.(no_prod_now = uniqueN(product), no_pols_now = uniqueN(policy_num)),
by = .EACHI][, c("no_prod_now", "no_pols_now")],
# policies closed within previous 3 months of start_date
DT[DT, on = c("id", "end_date>=start_date_3mo", "end_date<=start_date"),
.(pols_closed_3mo = .N), by = .EACHI][, "pols_closed_3mo"],
# additional policies opened within previous 3 months of start_date
DT[DT, on = c("id", "start_date>=start_date_3mo", "start_date<=start_date"),
.(pols_opened_3mo = .N - 1L), by = .EACHI][, "pols_opened_3mo"])][
# omit helper column
, -"start_date_3mo"]
result
id policy_num product start_date end_date no_prod_now no_pols_now pols_closed_3mo pols_opened_3mo 1: 1 41551662 apple 2007-04-04 2011-04-04 1 1 0 0 2: 1 50966414 apple 2012-04-04 2014-04-04 1 1 0 0 3: 2 43077202 pear 2009-02-01 2018-02-01 1 1 0 0 4: 3 46927463 apple 2011-11-16 2012-11-16 1 1 0 0 5: 3 57130236 apple 2012-11-16 2012-11-16 1 1 2 1 6: 3 57050065 apple 2012-11-16 2014-11-16 1 1 2 1 7: 4 26196559 plum 1998-11-17 2007-11-17 1 1 0 0 8: 5 33545119 apple 2011-05-23 2014-06-13 1 1 0 0 9: 5 52304024 pear 2012-05-23 2018-05-23 2 2 0 0 10: 5 73953064 apple 2014-10-08 2017-10-08 2 2 0 0 11: 6 50340507 apple 2012-03-30 2013-03-30 1 1 0 0 12: 7 50491162 apple 2012-03-22 2013-03-22 1 1 0 0 13: 7 76577511 pear 2015-02-08 2017-02-08 1 1 0 0 14: 7 108067534 pear 2017-02-08 2018-02-08 1 1 1 0
请注意,在OP的预期结果与此处的结果之间,start_date
之前的前3个月内打开的政策存在差异。对于id == 3
,有两项政策在2012-11-16开始,因此它需要为每一行计算一个额外的政策。对于id == 5
,所有start_date
相差超过3个月,因此不应该重叠。
此外,对于在start_date
之前的前3个月内关闭的政策,第5行和第6行的值均为2,因为id == 3
有两项政策在2012-11-16结束。