Question

我在R中有一个大矩阵（如类Matrix）。它是稀疏的（仅包含01）。

我所做的是（如果M是矩阵）

j<-list()
for(i in 1:dim(M)[1]){
    which(M[i,]==1)->j[[i]]
}

这通常很快，但是在如此大的矩阵（昏暗的1.7 Mil到5000）上，它非常慢。我只是不相信没有更快的方法来获得那些每行1个的col的索引....

Answer 1

使用@ zx8754的例子

M <- matrix(c(1,1,0,0,1,0,1,1,1,1,1,1), 4)

我们可以定义一个辅助矩阵，其中包含条目的行索引和列索引等于1：

oneMat <- which(M==1, arr.ind=TRUE)

从这个辅助矩阵中我们可以创建一个列表，其中包含每行中等于1的列号

oneList <- lapply(1:nrow(M), function(x) oneMat[oneMat[,1] == x, 2])
#[[1]] 
#[1] 1 2 3
#
#[[2]]
#[1] 1 3
#
#[[3]]
#[1] 2 3
#
#[[4]]
#[1] 2 3

如果矩阵M很大且稀疏，则矩阵oneMat应远小于M。在这种情况下，我认为第二步中使用的lapply()循环可以导致相对于OP中描述的for循环的加速。

经过一些测试，我遗憾地不得不承认这个答案特别慢。 @ColonelBeauvel的解决方案是赢家：

j <- list()
set.seed(123)
M <- matrix(rbinom(1e5,1,0.01),ncol=100)
library(microbenchmark)
f_which_and_lappy <- function(x) {oneMat <- which(x==1, arr.ind=TRUE); 
           lapply(1:nrow(x), function(i) oneMat[oneMat[,1] == i, 2])}
f_only_apply <- function(x) {apply(x, 1, function(i) which(i == 1))}
f_with_data.frame <- function(x) {with(data.frame(which(!!x, arr.ind=T)), split(col, row))}
f_OP <- function(x) {for(i in 1:dim(x)[1]){which(x[i,]==1)->j[[i]]}}
res <- microbenchmark(
  f_which_and_lappy(M),
  f_only_apply(M),
  f_with_data.frame(M), 
  f_OP(M),times=1000L)
#> res
#Unit: microseconds
#                 expr       min        lq       mean     median        uq       max neval  cld
# f_which_and_lappy(M) 11063.170 11254.032 12090.9506 11351.1830 11570.662  31313.48  1000    d
#      f_only_apply(M)  3204.572  3359.410  4117.4971  3456.3960  3610.945  25352.35  1000  b  
# f_with_data.frame(M)   739.556   811.906   912.4726   918.0315   946.700  18623.77  1000 a   
#              f_OP(M)  5642.639  5854.751  6955.9980  5969.3685  6151.209 148847.22  1000   c

Answer 2

我宁愿选择矢量化方法，并使用split代替这些apply / lapply系列函数：

M  = matrix(c(1,1,0,0,1,0,1,1,1,1,1,1), 4)

with(data.frame(which(!!M, arr.ind=T)), split(col, row))
#$`1`
#[1] 1 2 3

#$`2`
#[1] 1 3

#$`3`
#[1] 2 3

#$`4`
#[1] 2 3

Answer 3

评论后

编辑：

apply(M, 1, function(i) which(i == 1))

# [[1]]
# [1] 1 2 3
# 
# [[2]]
# [1] 1 3
# 
# [[3]]
# [1] 2 3
# 
# [[4]]
# [1] 2 3

试试这个例子：

#data
M <- matrix(c(1,1,0,0,1,0,1,1,1,1,1,1), 4)
#      [,1] [,2] [,3]
# [1,]    1    1    1
# [2,]    1    0    1
# [3,]    0    1    1
# [4,]    0    1    1

# index of rows with all ones
which(rowSums(M == 1) == ncol(M))
# [1] 1

# index of cols with all ones
which(colSums(M == 1) == nrow(M))
# [1] 3

加速大矩阵中的索引

3 个答案: