我有工作代码来计算运行的drawdown.duration
,其中drawdown.duration
定义为当前月份与上一个peak
之间的月份数。但是,我将代码实现为for
循环,并且运行速度非常慢。
在R
中是否有更有效/更快的方法来实现此目的?
该代码使用名为data.frame
的{{1}}(特别是tibble
,因为我一直在使用dplyr
)。
returnsWithValues
我已经使用> structure(list(date = structure(c(789, 820, 850, 881, 911, 942
), class = "Date"), value = c(0.94031052, 0.930751624153046,
0.926756311376762, 0.874209664097166, 0.843026010916249, 2.1),
peak = c(1, 1, 1, 1, 1, 2.1), drawdown = c(-0.05968948, -0.0692483758469535,
-0.0732436886232377, -0.125790335902834, -0.156973989083751,
0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
# A tibble: 6 x 4
date value peak drawdown
<date> <dbl> <dbl> <dbl>
1 1972-02-29 0.940 1 -0.0597
2 1972-03-31 0.931 1 -0.0692
3 1972-04-30 0.927 1 -0.0732
4 1972-05-31 0.874 1 -0.126
5 1972-06-30 0.843 1 -0.157
6 1972-07-31 2.1 2.1 0
循环实现了drawdown.duration
:
for
哪个给出正确的答案为:
returnsWithValues <- returnsWithValues %>% mutate(drawdown.duration = NA)
# add drawdown.duration col
for (row in 1:nrow(returnsWithValues)) {
if(returnsWithValues[row,"value"] == returnsWithValues[row,"peak"]) {
returnsWithValues[row,"drawdown.duration"] = 0
} else {
if(row == 1){
returnsWithValues[row,"drawdown.duration"] = 1
} else {
returnsWithValues[row,"drawdown.duration"] = returnsWithValues[row - 1,"drawdown.duration"] + 1
}
}
}
答案 0 :(得分:2)
我将根据需要删除for循环,并使用索引的想法。
indices <- function(returnsWithValues){
indices_logical<-(returnsWithValues[["value"]] == returnsWithValues[["peak"]]) #return a logical vector where true values are for equal and false for not.
indices_to_zero<-which(indices_logical) # which values are true
indices_drawdpwn<-which(!indices_logical) # which values are false
returnsWithValues[indices_to_zero,"drawdown.duration"] <- 0
returnsWithValues[indices_drawdpwn,"drawdown.duration"] <- 1:length(indices_drawdpwn) #basically you compute this if I understand correctly
returnsWithValues
这是包装在函数中的for循环。
for_loop<-function(returnsWithValues){
# add drawdown.duration col
for (row in 1:nrow(returnsWithValues)) {
if(returnsWithValues[row,"value"] == returnsWithValues[row,"peak"]) {
returnsWithValues[row,"drawdown.duration"] = 0
} else {
if(row == 1){
returnsWithValues[row,"drawdown.duration"] = 1
} else {
returnsWithValues[row,"drawdown.duration"] = returnsWithValues[row - 1,"drawdown.duration"] + 1
}
}
}
returnsWithValues
}
与for循环相比,这是一个基准。
microbenchmark::microbenchmark(
"for loop" = flp<-for_loop(returnsWithValues),
indices = ind<-indices(returnsWithValues),
times = 10
)
Unit: microseconds
expr min lq mean median uq max neval
for loop 8671.228 8699.555 8857.198 8826.8185 8967.631 9196.708 10
indices 92.781 99.349 106.328 102.8385 115.360 122.749 10
all.equal(ind,flp)
[1] TRUE
答案 1 :(得分:1)
我认为只要每个peak
值都是唯一的并且以后不会在另一个组中重复,就可以做到:
returnsWithValues %>%
group_by(peak) %>%
mutate(drawdown.duration = cumsum(value != peak))
如果确实有重复的峰值,则可能需要一种方法来将连续的peak
值之内分组,例如
returns %>%
# Start counting the number of groups at 1, and every time
# peak changes compared to the previous row, add 1
mutate(peak_group = cumsum(c(1, peak[-1] != head(peak, -1)))) %>%
group_by(peak_group) %>%
mutate(drawdown.duration = cumsum(value != peak))