使用最直接的语法M[,1] <- 0
为稀疏矩阵的列分配值在R中出乎意料地慢。遵循https://stat.ethz.ch/pipermail/r-help/2010-December/262365.html的建议,我试图使列从密集存储中无效矩阵的编码(参见nullify.column函数),但在某些情况下甚至更慢,在其他情况下更快,具体取决于列中要取消的非零值的数量:
library(Matrix)
nullify.column <- function(M, i) {
M.dense <- summary(M)
filter <- M.dense$j!=i
return(sparseMatrix(i=M.dense$i[filter], j=M.dense$j[filter], x=M.dense$x[filter]))
}
build.random.sparse.matrix <- function(n,p,q) {
i <- sample(x = 1:n, replace = T, size = q)
j <- sample(x = 1:p, replace = T, size = q)
s <- rnorm(q)^2
M <- sparseMatrix(i,j,x = s)
return(M)
}
t0 <- Sys.time()
n <- 1000000
p <- 50000
sparse.ratio <- 0.001
q <- n*p * sparse.ratio
t1 <- Sys.time()
A <- build.random.sparse.matrix(n,p,q)
B <- build.random.sparse.matrix(n,1,q*2) # column to nullify with more non-zero-values
M <- cbind(B,A)
t2 <- Sys.time()
delta <- round(as.numeric(difftime(t2,t1,units="secs")),2)
print(paste(c("Building sparse matrix took ", delta, "s"), collapse=""))
t1 <- Sys.time()
M.bis <- nullify.column(M,1)
t2 <- Sys.time()
delta <- round(as.numeric(difftime(t2,t1,units="secs")),2)
print(paste(c("nullify.column took ", delta, "s"), collapse=""))
t1 <- Sys.time()
M[,1] <- 0
t2 <- Sys.time()
delta <- round(as.numeric(difftime(t2,t1,units="secs")),2)
print(paste(c( "M[,i] <- 0 took ", delta, "s"), collapse=""))
t3 <- Sys.time()
delta <- round(as.numeric(difftime(t3,t0,units="secs")),2)
print(paste(c( "overall time ", delta, "s"), collapse=""))
返回
[1] "Building sparse matrix took 58.86s"
[1] "nullify.column took 31.2s"
[1] "M[,i] <- 0 took 186.55s"
[1] "overall time 278.49s"
有什么想法吗?这个M[,1] <- 0
需要比高级SVD算法更长的时间,并且会破坏我的整体表现。
更新
只是为了检查我,尝试了下面的代码
# test 2
M <- cbind(B,A)
N <- t(M)
t1 <- Sys.time()
N[1,] <- 0
t2 <- Sys.time()
delta <- round(as.numeric(difftime(t2,t1,units="secs")),2)
print(paste(c( "t(M)[,i] <- 0 took ", delta, "s"), collapse=""))
返回
[1] "t(M)[,i] <- 0 took 2.05s"
似乎稀疏矩阵是面向行的而不是面向列的,并且使行无效而不是列更快。