Question

有没有人建议我如何将此代码矢量化或以其他方式加快速度？我正在创建一个可能非常大的矩阵。在每一行中，我想随机选择2列，并将它们从0翻转为1。

我不能选择相同的行号和列号，即矩阵的对角线将为零，因此(1:N)[-j]中的sample()。因为这会随着每行的变化而变化，所以我无法通过使用矢量化来看到这种方法，但并行化可能是一种选择吗？

我使用library(Matrix)来表示稀疏矩阵功能。

library(Matrix)
N <- 100
m <- Matrix(0, nrow = N, ncol = N)

for(j in 1:N) {
    cols <- sample((1:N)[-j], 2) #Choose 2 columns not equal to the 
    m[j, cols] <- 1
}

有什么想法吗？

Answer 1

library(Matrix)
N <- 7

desired_output <- Matrix(0, nrow = N, ncol = N)
set.seed(1)
for(j in 1:N) {
  cols <- sample((1:N)[-j], 2) #Choose 2 columns not equal to the 
  desired_output[j, cols] <- 1
}

# 7 x 7 sparse Matrix of class "dgCMatrix"
#                   
# [1,] . . 1 . . . 1
# [2,] . . . . 1 1 .
# [3,] . 1 . . . 1 .
# [4,] . . . . 1 . 1
# [5,] 1 . . 1 . . .
# [6,] 1 1 . . . . .
# [7,] . 1 . . 1 . .

res <- Matrix(0, nrow = N, ncol = N)
set.seed(1)
ind <- cbind(rep(1:N, each = 2), c(sapply(1:N, function(j) sample((1:N)[-j], 2))))
res[ind] <- 1

all.equal(res, desired_output)
# [1] TRUE

快速工作台：

microbenchmark::microbenchmark(
  OP = {
    desired_output <- Matrix(0, nrow = N, ncol = N)
    set.seed(1)
    for(j in 1:N) {
      cols <- sample((1:N)[-j], 2) #Choose 2 columns not equal to the 
      desired_output[j, cols] <- 1
    }
  },
  Aurele = {
    res <- Matrix(0, nrow = N, ncol = N)
    set.seed(1)
    ind <- cbind(rep(1:N, each = 2), c(sapply(1:N, function(j) sample((1:N)[-j], 2))))
    res[ind] <- 1
  }
)

# Unit: milliseconds
#    expr       min        lq      mean    median        uq       max neval cld
#      OP 10.240969 10.509384 11.065336 10.804949 11.044846 14.903377   100   b
#  Aurele  1.185001  1.258037  1.392021  1.363503  1.434818  4.553614   100  a

Answer 2

编辑：我编辑了我的答案，使其更简单，并包括R和RcppArmadillo进行采样的方式。现在它似乎与N呈线性关系（正如我想的那样）。

您的代码中有两个“问题”：

sample((1:N)[-j], 2)是不必要的分配内存，使您的解决方案与N呈二次方式。解决方案是使用拒绝抽样，因为N很大（因此不会经常发生拒绝）。
替换矩阵中不“连续”的索引。

确实，因为您的样本没有替换，所以为您的问题制作矢量化解决方案并不简单。但同样，通过使用拒绝抽样也是可能的。在这里，我更喜欢Rcpp解决方案：

RCPP：

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerMatrix sample2(int N) {

  IntegerMatrix res(2 * N, 2);
  int j, ind1, ind2;

  for (j = 0; j < N; j++) {

    res(2 * j, 0) = res(2 * j + 1, 0) = j + 1;

    // sample first one
    do {
      ind1 = N * unif_rand();
    } while (ind1 == j);
    res(2 * j, 1) = ind1 + 1;

    // sample second one
    do {
      ind2 = N * unif_rand();
    } while (ind2 == j || ind2 == ind1);
    res(2 * j + 1, 1) = ind2 + 1;
  }

  return res;
}

R：

# table(replicate(1e5, sample2(5)))  # Verify that the sampling is OK
library(Matrix)
N <- 1000
m <- Matrix(0, nrow = N, ncol = N)
m[sample2(N)] <- 1

基准：

microbenchmark::microbenchmark(
  OP = {
    desired_output <- Matrix(0, nrow = N, ncol = N)
    for(j in 1:N) {
      cols <- sample((1:N)[-j], 2) #Choose 2 columns not equal to the 
      desired_output[j, cols] <- 1
    }
  },
  Aurele = {
    res <- Matrix(0, nrow = N, ncol = N)
    ind <- cbind(rep(1:N, each = 2), c(sapply(1:N, function(j) sample((1:N)[-j], 2))))
    res[ind] <- 1
  },
  privefl = {
    m <- Matrix(0, nrow = N, ncol = N)
    m[sample2(N)] <- 1
  },
  times = 20
)

N = 1000的结果：

Unit: milliseconds
    expr        min         lq       mean     median         uq        max neval
      OP 599.996226 605.868229 618.479868 615.653853 625.908794 679.292360    20
  Aurele  12.315795  12.633971  14.183891  13.148149  15.118948  19.649716    20
 privefl   1.401824   1.493371   1.649015   1.622826   1.704273   2.520239    20

N = 10,000的结果：

Unit: milliseconds
    expr        min         lq       mean     median        uq         max neval
  Aurele 812.018743 845.434915 903.387191 863.851661 967.08294 1084.738882    20
 privefl   3.657525   4.083799   4.409226   4.239576   4.49501    6.413498    20

Vectorise R代码从每行中随机选择2列

2 个答案: