优化sapply()或for(),paste(),以有效地将稀疏三元组矩阵转换为libsvm格式

时间:2017-01-05 05:12:43

标签: r performance

我有一段R代码,我想优化速度以处理更大的数据集。它目前依赖于sapply循环数字向量(对应于稀疏矩阵的行)。下面可重复的例子得到了问题的结论;它是三行函数expensive(),它扼杀了时间,它显而易见的原因(对于每个循环的大量匹配的大向量,以及两个嵌套的paste语句)。在我放弃并开始努力在C ++中完成这项工作之前,我有什么遗漏的东西吗?有没有办法对sapply调用进行矢量化,使其达到一个数量级或三个数量级?

library(microbenchmark)

# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000

# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10

stm <- data.frame(
  i = sample(1:n, nonzerovalues, replace = TRUE),
  j = sample(1:ncols, nonzerovalues, replace = TRUE),
  v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)

# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v

expensive <- function(){
  sapply(1:n, function(k){
    # microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
    whichi <- which(i == k)
    paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
  })
}

microbenchmark(expensive())

expensive的输出是n元素的字符向量,如下所示:

 [1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"                                                                                
 [2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"                         
 [3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8" 

为了它的价值,我们的动机是从稀疏矩阵格式有效地写入数据 - 从slamMatrix,但从slam开始 - 到libsvm格式(这是上面的格式,但每行以一个数字开头,表示支持向量机的目标变量 - 在本例中省略,因为它不是速度问题的一部分)。试图改进this question的答案。我从其中分配了一个存储库,并使用these functions调整其处理稀疏矩阵的方法。 tests表明它工作正常;但它doesn't scale up

1 个答案:

答案 0 :(得分:2)

使用包data.table。其res1 <- expensive() library(data.table) cheaper <- function() { setDT(stm) res <- stm[, .(i, jv = paste(j, v, sep = ":")) ][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]] setDF(stm) #clean-up which might not be necessary res } res2 <- cheaper() all.equal(res1, res2) #[1] TRUE microbenchmark(expensive(), cheaper()) #Unit: milliseconds # expr min lq mean median uq max neval cld # expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417 100 b # cheaper() 15.31835 15.66584 16.16267 15.98363 16.33637 18.35359 100 a 与快速排序相结合,使您无法找到相等{{1}}值的索引。

{{1}}