data.table + r:使用其他两个日期列

时间:2016-02-17 01:44:03

标签: r dataframe data.table subset mean

我有一个data.frame DF,下面给出了4列。

DF <- structure(list(Ticker = c("ABC", "ABC", "ABC", "ABC","ABC","ABC","ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC","ABC","XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ","XYZ", "XYZ", "XYZ", "XYZ"), `Enter Date` = c("2005-02-08", "2005-02-23","2005-06-07", "2005-06-08", "2005-08-16", "2005-09-07", "2005-11-15","2005-11-17", "2005-12-06", "2005-12-23", "2006-02-09", "2006-02-10","2006-02-15", "2006-02-22", "2006-05-01", "2005-02-22", "2005-02-28","2005-03-01", "2005-03-03", "2005-03-04", "2005-03-11", "2005-03-15","2005-04-04", "2005-04-05", "2005-04-15", "2005-04-22", "2005-04-28"), `Exit Date` = c("2005-03-09", "2005-03-23", "2005-07-06","2005-07-07", "2005-09-14", "2005-10-05", "2005-12-14", "2005-12-16","2006-01-05", "2006-01-25", "2006-03-10", "2006-03-13", "2006-03-16","2006-03-22", "2006-05-30", "2005-03-22", "2005-03-29", "2005-03-30","2005-04-01", "2005-04-04", "2005-04-11", "2005-04-13", "2005-05-02","2005-05-03", "2005-05-13", "2005-05-20","2005-05-26"), Return = c(4.669,4.034, 3.796, -4.059, -11.168, -0.496,-3.597, 3.45, -4.428,1.914, 3.577, 4, 8.451, 5.521, 10.324, 3.104, 0.787,-3.407,-1.441, -4.157, 4.343, 2.827, 0.425, -1.37, -3.175, -11.027,8.144)), .Names = c("Ticker", "Enter Date", "Exit Date", "Return"), row.names = c(NA, 27L), class = "data.frame")

我想计算“返回”列的累积平均值,其中“输入日期”&gt; “退出日期”表示唯一的输入日期和每个代码。我可以通过两个步骤以data.frame的方式完成它。我使用的代码是

calCumAve <- function(data,yvar,nSkip)
{
nrs <- seq_len(nrow(data))
CumAve <-  c(rep(NA,nSkip),sapply(nrs[nrs>nSkip],
FUN=function(t){mean(data[data$"Enter Date"[t]> data$"Exit Date", yvar])}))
return(CumAve)
}

DFOut <-  do.call(rbind,lapply(sort(unique(DF$Ticker)), FUN=function(s){
                     sd <- DF[DF$Ticker==s,]
                     sd$AvgRet <- calCumAve(data=sd,yvar="Return",nSkip=4)
                     return(sd)}))

所需的输出是DFOut。

我想以 data.table 方式执行此操作。在data.table中应用时,我面临的主要问题是使用两个日期列来设置Return列。几件事情要考虑:

(1)实际上将有1000个代码(在本例中只有2个,ABC和XYZ)和超过10年的每日数据。

(2)在不指定nSkip的情况下执行操作。对于Enter Date&lt; = Exit Date,它应该给出NA(不是在DFOut的第20:22行中的NaN)

(3)如果可能,在子设置data.table中使用列名。给定的示例有四列,但工作data.table将有超过25列,我需要通过更改yvar在多个列上应用相同的计算。

非常感谢任何帮助。提前谢谢。

1 个答案:

答案 0 :(得分:2)

dt = as.data.table(DF) # or setDT to convert in place

# cumulative mean, but without the date restriction
dt[, rawAvgRets := cumsum(Return) / (1:.N), by = Ticker]

# find the latest matching date using a rolling merge (assumes sorted dates)
# if you run into > vs >= issues, adjust enter or exit date by a day
dt[, avgRets := dt[dt, rawAvgRets, roll = TRUE,
                   on = c('Ticker' = 'Ticker', 'Exit Date' = 'Enter Date')]]
#    Ticker Enter Date  Exit Date  Return rawAvgRets    avgRets
# 1:    ABC 2005-02-08 2005-03-09   4.669  4.6690000         NA
# 2:    ABC 2005-02-23 2005-03-23   4.034  4.3515000         NA
# 3:    ABC 2005-06-07 2005-07-06   3.796  4.1663333  4.3515000
# 4:    ABC 2005-06-08 2005-07-07  -4.059  2.1100000  4.3515000
# 5:    ABC 2005-08-16 2005-09-14 -11.168 -0.5456000  2.1100000
# 6:    ABC 2005-09-07 2005-10-05  -0.496 -0.5373333  2.1100000
# 7:    ABC 2005-11-15 2005-12-14  -3.597 -0.9744286 -0.5373333
# 8:    ABC 2005-11-17 2005-12-16   3.450 -0.4213750 -0.5373333
# 9:    ABC 2005-12-06 2006-01-05  -4.428 -0.8665556 -0.5373333
#10:    ABC 2005-12-23 2006-01-25   1.914 -0.5885000 -0.4213750
#11:    ABC 2006-02-09 2006-03-10   3.577 -0.2098182 -0.5885000
#12:    ABC 2006-02-10 2006-03-13   4.000  0.1410000 -0.5885000
#13:    ABC 2006-02-15 2006-03-16   8.451  0.7802308 -0.5885000
#14:    ABC 2006-02-22 2006-03-22   5.521  1.1188571 -0.5885000
#15:    ABC 2006-05-01 2006-05-30  10.324  1.7325333  1.1188571
#16:    XYZ 2005-02-22 2005-03-22   3.104  3.1040000         NA
#17:    XYZ 2005-02-28 2005-03-29   0.787  1.9455000         NA
#18:    XYZ 2005-03-01 2005-03-30  -3.407  0.1613333         NA
#19:    XYZ 2005-03-03 2005-04-01  -1.441 -0.2392500         NA
#20:    XYZ 2005-03-04 2005-04-04  -4.157 -1.0228000         NA
#21:    XYZ 2005-03-11 2005-04-11   4.343 -0.1285000         NA
#22:    XYZ 2005-03-15 2005-04-13   2.827  0.2937143         NA
#23:    XYZ 2005-04-04 2005-05-02   0.425  0.3101250 -1.0228000
#24:    XYZ 2005-04-05 2005-05-03  -1.370  0.1234444 -1.0228000
#25:    XYZ 2005-04-15 2005-05-13  -3.175 -0.2064000  0.2937143
#26:    XYZ 2005-04-22 2005-05-20 -11.027 -1.1900909  0.2937143
#27:    XYZ 2005-04-28 2005-05-26   8.144 -0.4122500  0.2937143
#    Ticker Enter Date  Exit Date  Return rawAvgRets    avgRets