如何使用data.table查找特定日期范围内的列的最大值

时间:2017-10-17 10:57:42

标签: r data.table

我有一个包含contractID,data和DaysPastDue信息的数据集。我如何期待每行12个月,并确定与该合同ID相对应的最大DPD。 数据集如下所示

 Contract_number       Date    DPD
 1:               a 2014-03-01  14
 2:               a 2014-03-01   5
 3:               a 2014-10-01   6
 4:               a 2014-10-01  16
 5:               a 2015-12-01   4
 6:               a 2015-12-01  17
 7:               a 2016-09-01  16
 8:               a 2016-09-01  15
 9:               a 2016-10-01   3
10:               a 2016-10-01   8
11:               b 2014-05-01  18
12:               b 2014-05-01   9
13:               b 2014-08-01   2
14:               b 2014-08-01  14

生成此数据集的代码

library(data.table)
set.seed(123)
dummy_data= data.table(Contract_number = letters[1:4], 
                       Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
                       DPD=sample.int(20:50, 40, replace = TRUE)
)
dummy_data[order(Contract_number,Date)]

我有一个dplyr解决方案,想知道是否有更简洁的数据表方式来做到这一点?

max_dpd_data<-dummy_data %>% left_join(dummy_data,dummy_data,by="Contract_number") %>% 
  filter(Date.y>Date.x & Date.y<=(Date.x + months(12))) %>%
  group_by(Contract_number, Date.x) %>% summarise(Max_DPD_12_M = max(DPD.y), N_Mnths_Future=n()) %>% 
  rename(Date='Date.x')
dummy_data1<- left_join(dummy_data,max_dpd_data,by = c("Contract_number","Date"))

我也不想使用expand.grid来填补缺失的月份,然后使用Shift。

1 个答案:

答案 0 :(得分:1)

我认为你正在寻找类似的东西

clients.SupportsFiltering.ToString()

我不确定您是否想在12个月内观察月份。此外,> library(data.table) > set.seed(123) > dummy_data= data.table(Contract_number = letters[1:4], + Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20), + DPD=sample.int(20:50, 40, replace = TRUE) + ) > > # You can useset key to sort > setkey(dummy_data, Contract_number, Date) > dummy_data Contract_number Date DPD 1: a 2014-05-01 16 2: a 2014-05-01 3 3: a 2014-11-01 18 4: a 2014-11-01 3 5: a 2015-05-01 14 6: a 2015-05-01 16 7: a 2016-07-01 14 8: a 2016-07-01 4 9: a 2016-09-01 6 10: a 2016-09-01 6 11: b 2014-01-01 5 12: b 2014-01-01 16 13: b 2014-02-01 15 14: b 2014-02-01 3 15: b 2015-01-01 3 16: b 2015-01-01 18 17: b 2016-04-01 14 18: b 2016-04-01 9 19: b 2016-10-01 16 20: b 2016-10-01 3 21: c 2014-03-01 1 22: c 2014-03-01 12 23: c 2014-06-01 7 24: c 2014-06-01 18 25: c 2015-02-01 13 26: c 2015-02-01 9 27: c 2015-04-01 11 28: c 2015-04-01 5 29: c 2016-01-01 20 30: c 2016-01-01 1 31: d 2014-12-01 19 32: d 2014-12-01 9 33: d 2015-07-01 10 34: d 2015-07-01 5 35: d 2015-12-01 5 36: d 2015-12-01 8 37: d 2016-02-01 12 38: d 2016-02-01 10 39: d 2016-06-01 20 40: d 2016-06-01 8 Contract_number Date DPD > > # Add yearmonth decimal column > dummy_data[, ym := as.integer(format(Date, "%Y%m"))][ + , ym := (ym %/% 100) + (ym %% 100) / 12][, ym_less_one := ym - 1][ + , ym2 := ym] > > dummy_data <- dummy_data[ + dummy_data, on = c("Contract_number", "ym>ym", "ym_less_one<=ym2"), + .(Date = first(i.Date), DPD = first(i.DPD), max_DPD = max(DPD)), + by =.EACHI][, c("ym", "ym_less_one") := NULL] > > print(dummy_data) Contract_number Date DPD max_DPD 1: a 2014-05-01 16 18 2: a 2014-05-01 3 18 3: a 2014-11-01 18 16 4: a 2014-11-01 3 16 5: a 2015-05-01 14 NA 6: a 2015-05-01 16 NA 7: a 2016-07-01 14 6 8: a 2016-07-01 4 6 9: a 2016-09-01 6 NA 10: a 2016-09-01 6 NA 11: b 2014-01-01 5 18 12: b 2014-01-01 16 18 13: b 2014-02-01 15 18 14: b 2014-02-01 3 18 15: b 2015-01-01 3 NA 16: b 2015-01-01 18 NA 17: b 2016-04-01 14 16 18: b 2016-04-01 9 16 19: b 2016-10-01 16 NA 20: b 2016-10-01 3 NA 21: c 2014-03-01 1 18 22: c 2014-03-01 12 18 23: c 2014-06-01 7 13 24: c 2014-06-01 18 13 25: c 2015-02-01 13 20 26: c 2015-02-01 9 20 27: c 2015-04-01 11 20 28: c 2015-04-01 5 20 29: c 2016-01-01 20 NA 30: c 2016-01-01 1 NA 31: d 2014-12-01 19 10 32: d 2014-12-01 9 10 33: d 2015-07-01 10 20 34: d 2015-07-01 5 20 35: d 2015-12-01 5 20 36: d 2015-12-01 8 20 37: d 2016-02-01 12 20 38: d 2016-02-01 10 20 39: d 2016-06-01 20 NA 40: d 2016-06-01 8 NA Contract_number Date DPD max_DPD 操作+浮点问题可能存在一些问题。解决方案可能是从>=列中减去some_factor * .Machine$double.eps