我有一个包含contractID,data和DaysPastDue信息的数据集。我如何期待每行12个月,并确定与该合同ID相对应的最大DPD。 数据集如下所示
Contract_number Date DPD
1: a 2014-03-01 14
2: a 2014-03-01 5
3: a 2014-10-01 6
4: a 2014-10-01 16
5: a 2015-12-01 4
6: a 2015-12-01 17
7: a 2016-09-01 16
8: a 2016-09-01 15
9: a 2016-10-01 3
10: a 2016-10-01 8
11: b 2014-05-01 18
12: b 2014-05-01 9
13: b 2014-08-01 2
14: b 2014-08-01 14
生成此数据集的代码
library(data.table)
set.seed(123)
dummy_data= data.table(Contract_number = letters[1:4],
Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
DPD=sample.int(20:50, 40, replace = TRUE)
)
dummy_data[order(Contract_number,Date)]
我有一个dplyr解决方案,想知道是否有更简洁的数据表方式来做到这一点?
max_dpd_data<-dummy_data %>% left_join(dummy_data,dummy_data,by="Contract_number") %>%
filter(Date.y>Date.x & Date.y<=(Date.x + months(12))) %>%
group_by(Contract_number, Date.x) %>% summarise(Max_DPD_12_M = max(DPD.y), N_Mnths_Future=n()) %>%
rename(Date='Date.x')
dummy_data1<- left_join(dummy_data,max_dpd_data,by = c("Contract_number","Date"))
我也不想使用expand.grid来填补缺失的月份,然后使用Shift。
答案 0 :(得分:1)
我认为你正在寻找类似的东西
clients.SupportsFiltering.ToString()
我不确定您是否想在12个月内观察月份。此外,> library(data.table)
> set.seed(123)
> dummy_data= data.table(Contract_number = letters[1:4],
+ Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
+ DPD=sample.int(20:50, 40, replace = TRUE)
+ )
>
> # You can useset key to sort
> setkey(dummy_data, Contract_number, Date)
> dummy_data
Contract_number Date DPD
1: a 2014-05-01 16
2: a 2014-05-01 3
3: a 2014-11-01 18
4: a 2014-11-01 3
5: a 2015-05-01 14
6: a 2015-05-01 16
7: a 2016-07-01 14
8: a 2016-07-01 4
9: a 2016-09-01 6
10: a 2016-09-01 6
11: b 2014-01-01 5
12: b 2014-01-01 16
13: b 2014-02-01 15
14: b 2014-02-01 3
15: b 2015-01-01 3
16: b 2015-01-01 18
17: b 2016-04-01 14
18: b 2016-04-01 9
19: b 2016-10-01 16
20: b 2016-10-01 3
21: c 2014-03-01 1
22: c 2014-03-01 12
23: c 2014-06-01 7
24: c 2014-06-01 18
25: c 2015-02-01 13
26: c 2015-02-01 9
27: c 2015-04-01 11
28: c 2015-04-01 5
29: c 2016-01-01 20
30: c 2016-01-01 1
31: d 2014-12-01 19
32: d 2014-12-01 9
33: d 2015-07-01 10
34: d 2015-07-01 5
35: d 2015-12-01 5
36: d 2015-12-01 8
37: d 2016-02-01 12
38: d 2016-02-01 10
39: d 2016-06-01 20
40: d 2016-06-01 8
Contract_number Date DPD
>
> # Add yearmonth decimal column
> dummy_data[, ym := as.integer(format(Date, "%Y%m"))][
+ , ym := (ym %/% 100) + (ym %% 100) / 12][, ym_less_one := ym - 1][
+ , ym2 := ym]
>
> dummy_data <- dummy_data[
+ dummy_data, on = c("Contract_number", "ym>ym", "ym_less_one<=ym2"),
+ .(Date = first(i.Date), DPD = first(i.DPD), max_DPD = max(DPD)),
+ by =.EACHI][, c("ym", "ym_less_one") := NULL]
>
> print(dummy_data)
Contract_number Date DPD max_DPD
1: a 2014-05-01 16 18
2: a 2014-05-01 3 18
3: a 2014-11-01 18 16
4: a 2014-11-01 3 16
5: a 2015-05-01 14 NA
6: a 2015-05-01 16 NA
7: a 2016-07-01 14 6
8: a 2016-07-01 4 6
9: a 2016-09-01 6 NA
10: a 2016-09-01 6 NA
11: b 2014-01-01 5 18
12: b 2014-01-01 16 18
13: b 2014-02-01 15 18
14: b 2014-02-01 3 18
15: b 2015-01-01 3 NA
16: b 2015-01-01 18 NA
17: b 2016-04-01 14 16
18: b 2016-04-01 9 16
19: b 2016-10-01 16 NA
20: b 2016-10-01 3 NA
21: c 2014-03-01 1 18
22: c 2014-03-01 12 18
23: c 2014-06-01 7 13
24: c 2014-06-01 18 13
25: c 2015-02-01 13 20
26: c 2015-02-01 9 20
27: c 2015-04-01 11 20
28: c 2015-04-01 5 20
29: c 2016-01-01 20 NA
30: c 2016-01-01 1 NA
31: d 2014-12-01 19 10
32: d 2014-12-01 9 10
33: d 2015-07-01 10 20
34: d 2015-07-01 5 20
35: d 2015-12-01 5 20
36: d 2015-12-01 8 20
37: d 2016-02-01 12 20
38: d 2016-02-01 10 20
39: d 2016-06-01 20 NA
40: d 2016-06-01 8 NA
Contract_number Date DPD max_DPD
操作+浮点问题可能存在一些问题。解决方案可能是从>=
列中减去some_factor * .Machine$double.eps
。