计算给定时间段内发生的不同实例的数量

时间:2017-08-18 11:29:23

标签: r date dplyr data.table time-series

我有虚拟数据

structure(list(id = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 7, 
7), policy_num = c(41551662L, 50966414L, 43077202L, 46927463L, 
57130236L, 57050065L, 26196559L, 33545119L, 52304024L, 73953064L, 
50340507L, 50491162L, 76577511L, 108067534L), product = c("apple", 
"apple", "pear", "apple", "apple", "apple", "plum", "apple", 
"pear", "apple", "apple", "apple", "pear", "pear"), start_date = 
structure(c(13607, 15434, 14276, 15294, 15660, 15660, 10547, 15117, 15483, 
16351, 15429, 15421, 16474, 17205), class = "Date"), end_date = structure(c(15068, 
16164, 17563, 15660, 15660, 16390, 13834, 16234, 17674, 17447, 
15794, 15786, 17205, 17570), class = "Date")), .Names = c("id", 
"policy_num", "product", "start_date", "end_date"), row.names = c(NA, 
-14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)


id policy_num product start_date   end_date
 1   41551662   apple 2007-04-04 2011-04-04
 1   50966414   apple 2012-04-04 2014-04-04
 2   43077202    pear 2009-02-01 2018-02-01
 3   46927463   apple 2011-11-16 2012-11-16
 3   57130236   apple 2012-11-16 2012-11-16
 3   57050065   apple 2012-11-16 2014-11-16
 4   26196559    plum 1998-11-17 2007-11-17
 5   33545119   apple 2011-05-23 2014-06-13
 5   52304024    pear 2012-05-23 2018-05-23
 5   73953064   apple 2014-10-08 2017-10-08
 6   50340507   apple 2012-03-30 2013-03-30
 7   50491162   apple 2012-03-22 2013-03-22
 7   76577511    pear 2015-02-08 2017-02-08
 7  108067534    pear 2017-02-08 2018-02-08

基于此,我想计算以下变量(按user_id分组):

1)当前持有的产品数量(no_prod_now) - 不同产品的数量,end_date&gt;目前已评估start_date。简单地说,user_id

start_date所持有的产品数量

2)当前持有的有效政策数量(no_policies_now) - 如上所述,但适用于policy_num

3)在当前start_datepolicies_open_3mo)之前3个月内开立的政策数量

4)policies_closed_3mo - 如上所述,但过去3个月内已关闭的政策数量

理想的输出如下:

 id policy_num product start_date   end_date no_prod_now no_policies_now policies_closed_3mo
  1   41551662   apple 2007-04-04 2011-04-04           1               1                   0
  1   50966414   apple 2012-04-04 2014-04-04           1               1                   0
  2   43077202    pear 2009-02-01 2018-02-01           1               1                   0
  3   46927463   apple 2011-11-16 2012-11-16           1               1                   0
  3   57130236   apple 2012-11-16 2012-11-16           1               1                   1
  3   57050065   apple 2012-11-16 2014-11-16           1               1                   2
  4   26196559    plum 1998-11-17 2007-11-17           1               1                   0
  5   33545119   apple 2011-05-23 2014-06-13           1               1                   0
  5   52304024    pear 2012-05-23 2018-05-23           2               2                   0
  5   73953064   apple 2014-10-08 2017-10-08           2               2                   0
  6   50340507   apple 2012-03-30 2013-03-30           1               1                   0
  7   50491162   apple 2012-03-22 2013-03-22           1               1                   0
  7   76577511    pear 2015-02-08 2017-02-08           1               1                   0
  7  108067534    pear 2017-02-08 2018-02-08           1               1                   1
policies_open_3mo
                0
                0
                0
                0
                0
                1
                0
                0
                1
                0
                0
                0
                0
                0

我正在寻找理想情况下在data.table中实施的解决方案,因为我会将其应用于大数据量,但base Rdplyr解决方案我总是可以转换为data.table,o也很有价值,谢谢!

1 个答案:

答案 0 :(得分:1)

这非常棘手,但可以通过一些非等自连接来解决

修改:事实证明更新加入并不能与非等自我加入一起工作曾预料到(见here)。因此,我必须完全修改代码以避免更新到位

相反,另外四列是由三个独立的非等自连接创建的,并且会合并为最终结果。

library(data.table)
library(lubridate)

result <- 
  # create helper column for previous three months periods.
  # lubridate's month arithmetic avoids NAs at end of month, e.g., February
  DT[, start_date_3mo := start_date %m-% period(month = 3L)][
    # start "cbind()" with original columns
  , c(.SD, 
      # count number of products and policies held at time of start_date 
      DT[DT, on = c("id", "start_date<=start_date", "end_date>start_date"), 
         .(no_prod_now = uniqueN(product), no_pols_now = uniqueN(policy_num)), 
         by = .EACHI][, c("no_prod_now", "no_pols_now")],
      # policies closed within previous 3 months of start_date
      DT[DT, on = c("id", "end_date>=start_date_3mo", "end_date<=start_date"), 
         .(pols_closed_3mo = .N), by = .EACHI][, "pols_closed_3mo"],
      # additional policies opened within previous 3 months of start_date
      DT[DT, on = c("id", "start_date>=start_date_3mo", "start_date<=start_date"), 
         .(pols_opened_3mo = .N - 1L), by = .EACHI][, "pols_opened_3mo"])][
           # omit helper column
           , -"start_date_3mo"]
result
    id policy_num product start_date   end_date no_prod_now no_pols_now pols_closed_3mo pols_opened_3mo
 1:  1   41551662   apple 2007-04-04 2011-04-04           1           1               0               0
 2:  1   50966414   apple 2012-04-04 2014-04-04           1           1               0               0
 3:  2   43077202    pear 2009-02-01 2018-02-01           1           1               0               0
 4:  3   46927463   apple 2011-11-16 2012-11-16           1           1               0               0
 5:  3   57130236   apple 2012-11-16 2012-11-16           1           1               2               1
 6:  3   57050065   apple 2012-11-16 2014-11-16           1           1               2               1
 7:  4   26196559    plum 1998-11-17 2007-11-17           1           1               0               0
 8:  5   33545119   apple 2011-05-23 2014-06-13           1           1               0               0
 9:  5   52304024    pear 2012-05-23 2018-05-23           2           2               0               0
10:  5   73953064   apple 2014-10-08 2017-10-08           2           2               0               0
11:  6   50340507   apple 2012-03-30 2013-03-30           1           1               0               0
12:  7   50491162   apple 2012-03-22 2013-03-22           1           1               0               0
13:  7   76577511    pear 2015-02-08 2017-02-08           1           1               0               0
14:  7  108067534    pear 2017-02-08 2018-02-08           1           1               1               0

请注意,在OP的预期结果与此处的结果之间,start_date之前的前3个月内打开的政策存在差异。对于id == 3,有两项政策在2012-11-16开始,因此它需要为每一行计算一个额外的政策。对于id == 5,所有start_date相差超过3个月,因此不应该重叠。

此外,对于在start_date之前的前3个月内关闭的政策,第5行和第6行的值均为2,因为id == 3有两项政策在2012-11-16结束。