我被告知根本不需要在R中使用“for”循环。所以,我想看看如何在我的R代码中摆脱这种类似Python的“for”循环:
diff.vec = c() # vector of differences
for (index in 1:nrow(yrdf)) { # yrdf is a data frame
if (index == numrows) {
diff = NA # because there is no entry "below" it
} else {
val_index = yrdf$Adj.Close[index]
val_next = yrdf$Adj.Close[index+1]
diff = val_index - val_next # diff between two adjacent values
diff = diff/yrdf$Adj.Close[index+1] * 100.0
}
diff.vec<-c(diff.vec,diff) # append to vector of differences
}
答案 0 :(得分:1)
根据我的经验,避免for
循环有三个原因。第一个是他们很难被他人阅读(如果你共享你的代码),apply
系列函数可以改进(并且在返回时更明确)。第二个是在某些情况下可以实现的速度优势,特别是如果你移动使代码并行运行(例如,大多数apply
函数都令人尴尬地并行,而for
循环需要更多的工作来打破开的)。
然而,这是你在这里服务的第三个原因:矢量化解决方案通常比上述任何一个更好,因为它避免了重复调用(例如,循环结束时的c
,{{1检查等)。在这里,您可以使用单个矢量化调用完成所有操作。
首先,一些示例数据
if
然后,我们将所有内容乘以set.seed(8675309)
yrdf <- data.frame(Adj.Close = rnorm(5))
,取100
中相邻条目的diff
,并使用矢量化除法除以以下条目。请注意,如果(并且仅当)您希望结果与输入的长度相同,我需要使用Adj.Close
填充。如果您不希望/需要NA
在向量的末尾,则可以更容易。
NA
返回
100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
而且,明确地说,这是[1] 238.06442 216.94975 130.41349 -90.47879 NA
比较:
microbenchmark
给出:
myForLoop <- function(){
numrows = nrow(yrdf)
diff.vec = c() # vector of differences
for (index in 1:nrow(yrdf)) { # yrdf is a data frame
if (index == numrows) {
diff = NA # because there is no entry "below" it
} else {
val_index = yrdf$Adj.Close[index]
val_next = yrdf$Adj.Close[index+1]
diff = val_index - val_next # diff between two adjacent values
diff = diff/yrdf$Adj.Close[index+1] * 100.0
}
diff.vec<-c(diff.vec,diff) # append to vector of differences
}
return(diff.vec)
}
microbenchmark::microbenchmark(
forLoop = myForLoop()
, vector = 100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
)
请注意,Unit: microseconds
expr min lq mean median uq max neval
forLoop 74.238 78.184 82.06786 81.287 84.3740 104.190 100
vector 20.193 21.718 23.91824 22.716 24.0535 80.754 100
方法大约占vector
循环时间的30%。随着数据大小的增加,这变得更加重要:
for
给出
set.seed(8675309)
yrdf <- data.frame(Adj.Close = rnorm(10000))
microbenchmark::microbenchmark(
forLoop = myForLoop()
, vector = 100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
)
请注意,这些比例的大小大小的差异 - 矢量版本占用的时间少于0.1%。在这里,这可能是因为每次调用Unit: microseconds
expr min lq mean median uq max neval
forLoop 306883.977 315116.446 351183.7997 325211.743 361479.6835 545383.457 100
vector 176.704 194.948 326.6135 219.512 236.9685 4989.051 100
来添加新条目都需要重新读取完整的向量。稍微改变可以使for循环加速一点,但不能一直到向量速度:
c
给出
myForLoopAlt <- function(){
numrows = nrow(yrdf)
diff.vec = numeric(numrows) # vector of differences
for (index in 1:nrow(yrdf)) { # yrdf is a data frame
if (index == numrows) {
diff = NA # because there is no entry "below" it
} else {
val_index = yrdf$Adj.Close[index]
val_next = yrdf$Adj.Close[index+1]
diff = val_index - val_next # diff between two adjacent values
diff = diff/yrdf$Adj.Close[index+1] * 100.0
}
diff.vec[index] <- diff # append to vector of differences
}
return(diff.vec)
}
microbenchmark::microbenchmark(
forLoop = myForLoop()
, newLoop = myForLoopAlt()
, vector = 100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
)
这节省了Unit: microseconds
expr min lq mean median uq max neval
forLoop 304751.250 315433.802 354605.5850 325944.9075 368584.2065 528732.259 100
newLoop 168014.142 179579.984 186882.7679 181843.7465 188654.5325 318431.949 100
vector 169.569 208.193 331.2579 219.9125 233.3115 2956.646 100
循环方法的一半时间,但仍然比矢量化解决方案慢。
答案 1 :(得分:0)
yrdf <- data.frame(Adj.Close = rnorm(100))
numrows <- length(yrdf$Adj.Close)
diff.vec <- c((yrdf$Adj.Close[1:(numrows-1)] / yrdf$Adj.Close[2:numrows] - 1) * 100, NA)
答案 2 :(得分:0)
您还可以使用lead
包中的dplyr
函数来获得所需的结果。
library(dplyr)
yrdf <- data.frame(Adj.Close = rnorm(100))
(yrdf$Adj.Close/lead(yrdf$Adj.Close)-1)*100
计算已从(a-b)/ b简化为a / b-1。这是一个矢量化操作,而不是for循环。