此问题最好通过示例显示,与此处提出的问题略有不同: Applying function to consecutive subvectors of equal size
我们说我有一些公司的价格数据" MMM"和" ABT"例如(价格的日期存储在此数据框的rownames中。):
> a
MMM ABT
1991-01-02 11.01 2.58
1991-01-03 10.83 2.48
1991-01-04 10.80 2.43
1991-01-07 10.67 2.39
1991-01-08 10.39 2.42
1991-01-09 10.18 2.42
1991-01-10 10.33 2.43
1991-01-11 10.59 2.44
1991-01-14 10.60 2.38
1991-01-15 10.54 2.39
首先,可能需要将此数据框中的日期拆分为" j"的相等间隔。让我们说j = 2.以下是我们要看的间隔:
interval1 is from 1991-01-02 to 1991-01-03
interval2 is from 1991-01-04 to 1991-01-07
interval3 is from 1991-01-08 to 1991-01-09
interval4 is from 1991-01-10 to 1991-01-11
interval5 is from 1991-01-14 to 1991-01-15
我希望包含最后一个值,如果它不存在,这就是我在下面使用unique()的原因。所以假设" j"间隔长度,我们可以以某种方式使用它们(可能有更好的方法来生成上述间隔):
beg <- rownames(a)[seq(1,nrow(a),2)]
# case for j = 2:
# [1] "1991-01-02" "1991-01-04" "1991-01-08" "1991-01-10" "1991-01-14"
end <- rownames(a)[seq(1,nrow(a),2)+1]
end <- unique(c(end[!is.na(end)],rownames(a)[nrow(a)]))
# case for j = 2:
# [1] "1991-01-03" "1991-01-07" "1991-01-09" "1991-01-11" "1991-01-15"
从这里开始,我有另一个数据框(b),其数据如下:
> b
portfolio_return
1991-01-09 0.010524144
1991-01-10 -0.010706638
1991-01-11 -0.015665796
1991-01-14 -0.015151515
1991-01-15 0.055000000
1991-01-16 -0.052173913
1991-01-21 -0.010204082
我期待的是在每个间隔期间找到平均值。例如:
interval1_values = "NA"
interval2_values = "NA"
interval3_values = c(0.010524144)
interval4_values = c(-0.010706638,-0.015665796)
interval5_values = c(-0.015151515, 0.055000000)
#From this we can then easily calculate the average over each interval.
average1 = mean(interval1_values)
average2 = mean(interval2_values)
#etc...
我目前的解决方案是这样的:
averages_interval <- function(a,b,j){
# replace 2 with j
beg <- rownames(a)[seq(1,nrow(a),j)]
# replace 2 with j
# replace 1 with j-1
end <- rownames(a)[seq(1,nrow(a),j)+j-1]
end <- unique(c(end[!is.na(end)],rownames(a)[nrow(a)]))
c <- rownames(b)
tmp <- c()
j <- 1
# these loops match our c-vector values in their proper interval
# for j = 2 case, it places c[1] in interval3, c[2] in interval4, and so on...
for(i in 1:length(c)){
while(j <= length(end)){
if(c[i]>=beg[j] && c[i]<=end[j]){
tmp <- c(tmp,j)
}
j <- j+1
}
j <- tmp[length(tmp)]
}
df <- data.frame(b,group=tmp)
df <- df[complete.cases(df),]
#row_names <- rownames(df)
# variable needed to store dates if needed later on since we use data.table
df <- data.table(df)
averages <- df[,list(mean=mean(portfolio_return)),by=group][[2]]
return(averages)
}
###### for j = 2
group mean
1: 2 0.01052414
2: 3 0.01318622
3: 4 0.01992424
有没有更有效的方法来解决这个问题?
非常感谢。
答案 0 :(得分:0)
您可以在下面找到使用data.table
的解决方案:
# reading in your data
x <- read.table(text='MMM ABT
1991-01-02 11.01 2.58
1991-01-03 10.83 2.48
1991-01-04 10.80 2.43
1991-01-07 10.67 2.39
1991-01-08 10.39 2.42
1991-01-09 10.18 2.42
1991-01-10 10.33 2.43
1991-01-11 10.59 2.44
1991-01-14 10.60 2.38
1991-01-15 10.54 2.39', header=TRUE, row.names=1)
#
y <- read.table(text='portfolio_return
1991-01-09 0.010524144
1991-01-10 -0.010706638
1991-01-11 -0.015665796
1991-01-14 -0.015151515
1991-01-15 0.055000000
1991-01-16 -0.052173913
1991-01-21 -0.010204082', header=TRUE, row.names=1)
# load required packages
require(data.table)
require(zoo)
# setting to data.table
setDT(x, keep.rownames=TRUE)
setDT(y, keep.rownames=TRUE)
# defining the intervals
# DOUBLE CHECK THIS; I DON'T UNDERSTAND HOW YOU DEFINE THESE
x[, interval := c(1, rep(1:nrow(x), each=2))[1:nrow(x)]]
# merge data
res <- merge(x, y, by='rn', all = TRUE)
# setting the date as key
res[, rn := as.Date(rn)]
setkey(res, 'rn')
# perhaps carry forward last observation?
# THIS MAY NOT BE WHAT YOU WANT... FEEL FREE TO CHANGE
res[, interval := na.locf(interval)]
# calculate means, start and end of interval
res[, list(start = min(rn),
end = max(rn),
mean_return = mean(portfolio_return)), by=interval]
## interval start end mean_return
## 1: 1 1991-01-02 1991-01-04 NA
## 2: 2 1991-01-07 1991-01-08 NA
## 3: 3 1991-01-09 1991-01-10 -0.000091247
## 4: 4 1991-01-11 1991-01-14 -0.015408656
## 5: 5 1991-01-15 1991-01-21 -0.002459332