我目前正在大型数据集上运行某些功能,因此每个操作都需要花费很长时间才能执行。
要查看我的计算进度,可以方便地打印迭代次数/已完成计算的百分比。使用循环,可以很容易地做到这一点。
但是,是否有可能对矢量化函数或预定义函数进行类似的处理而无需实际更改这些函数的源代码?
generate_string
来自此处:Generating Random Strings
generate_string <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
x <- generate_string(10000)
y <- generate_string(10000)
(即打印完成的百分比):
library(stringdist)
# amatch will find for each element in x the index of the most similar element in y
ind <- amatch(x,y, method = "jw", maxDist = 1)
答案 0 :(得分:1)
pbapply
是一个选项,但比直接调用要慢:
system.time({ind <- amatch(x,y, method = "jw", maxDist = 1)})
user system elapsed
27.79 0.05 9.72
library(pbapply)
ind <- pbsapply(x, function(xi) amatch(xi,y, method = "jw", maxDist = 1))
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 30s
此外,您注释的选项(将数据拆分成块)不太优雅,但速度更快,并且很容易并行化。
library(progress)
system.time({
nloops <- 20
pp <- floor(nloops * (0:(length(x)-1))/length(x)) + 1
ind <- c()
pb <- progress_bar$new(total = nloops)
for(i in 1:nloops) {
pb$tick()
ind <- c(ind, amatch(x[pp == i],y, method = "jw", maxDist = 1))
}
pb$terminate()
})
[===================================================================================] 100%
user system elapsed
25.96 0.06 9.21