我有一段R代码,我想优化速度以处理更大的数据集。它目前依赖于sapply
循环数字向量(对应于稀疏矩阵的行)。下面可重复的例子得到了问题的结论;它是三行函数expensive()
,它扼杀了时间,它显而易见的原因(对于每个循环的大量匹配的大向量,以及两个嵌套的paste
语句)。在我放弃并开始努力在C ++中完成这项工作之前,我有什么遗漏的东西吗?有没有办法对sapply
调用进行矢量化,使其达到一个数量级或三个数量级?
library(microbenchmark)
# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000
# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10
stm <- data.frame(
i = sample(1:n, nonzerovalues, replace = TRUE),
j = sample(1:ncols, nonzerovalues, replace = TRUE),
v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)
# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v
expensive <- function(){
sapply(1:n, function(k){
# microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
whichi <- which(i == k)
paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
})
}
microbenchmark(expensive())
expensive
的输出是n
元素的字符向量,如下所示:
[1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"
[2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"
[3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8"
为了它的价值,我们的动机是从稀疏矩阵格式有效地写入数据 - 从slam
或Matrix
,但从slam
开始 - 到libsvm格式(这是上面的格式,但每行以一个数字开头,表示支持向量机的目标变量 - 在本例中省略,因为它不是速度问题的一部分)。试图改进this question的答案。我从其中分配了一个存储库,并使用these functions调整其处理稀疏矩阵的方法。 tests表明它工作正常;但它doesn't scale up。
答案 0 :(得分:2)
使用包data.table。其res1 <- expensive()
library(data.table)
cheaper <- function() {
setDT(stm)
res <- stm[, .(i, jv = paste(j, v, sep = ":"))
][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]]
setDF(stm) #clean-up which might not be necessary
res
}
res2 <- cheaper()
all.equal(res1, res2)
#[1] TRUE
microbenchmark(expensive(),
cheaper())
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417 100 b
# cheaper() 15.31835 15.66584 16.16267 15.98363 16.33637 18.35359 100 a
与快速排序相结合,使您无法找到相等{{1}}值的索引。
{{1}}