Question

我有非常大的数据集，其维度为60K x 4 K。我正在尝试在列的每一行中连续添加每四个值。以下是较小的示例数据集。

    set.seed(123)
    mat <- matrix (sample(0:1, 48, replace = TRUE), 4)

   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,]    0    1    1    1    0    1    1    0    1     1     0     0
[2,]    1    0    0    1    0    1    1    0    1     0     0     0
[3,]    0    1    1    0    0    1    1    1    0     0     0     0
[4,]    1    1    0    1    1    1    1    1    0     0     0     0

以下是我要执行的操作：

mat[1,1] + mat[1,2] + mat[1,3] + mat[1,4] = 0 + 1 + 1 + 1 = 3

即。添加每四个值和输出。

mat[1,5] + mat[1,6] + mat[1,7] + mat[1,8] = 0 + 1 + 1 + 0 = 2

继续到矩阵的末尾（这里是12）。

mat[1,9] + mat[1,10] + mat[1,11] + mat[1,12]

完成第一行后，将相同行应用于第二行，如：

mat[2,1] + mat[2,2] + mat[2,3] + mat[2,4] 
mat[2,5] + mat[2,6] + mat[2,7] + mat[2,8]
mat[2,9] + mat[2,10] + mat[2,11] + mat[2,12]

结果将是nrow x (ncol)/4矩阵。

预期结果如下：

          col1-col4      col5-8   col9-12
row1        3              2        2
row2        2              2        1
row3        2              3        0
row4        3              4        0

类似于第3行到矩阵中的行数。我怎样才能有效地循环这个。

Answer 1

虽然Matthew的回答真的很酷（+1，顺便说一句），如果你避免使用apply并使用*Sums函数，你可以得到更多（~100x）更快的解决方案这种情况colSums），以及一些矢量操作技巧：

funSums <- function(mat) {
  t.mat <- t(mat)                                    # rows become columns
  dim(t.mat) <- c(4, length(t.mat) / 4)              # wrap columns every four items (this is what we want to sum)
  t(matrix(colSums(t.mat), nrow=ncol(mat) / 4))      # sum our new 4 element columns, and reconstruct desired output format
}
set.seed(123)
mat <- matrix(sample(0:1, 48, replace = TRUE), 4)
funSums(mat)

产生所需的输出：

     [,1] [,2] [,3]
[1,]    3    2    2
[2,]    2    2    1
[3,]    2    3    0
[4,]    3    4    0

现在，让我们制作一些真正的尺寸并与其他选项进行比较：

set.seed(123)
mat <- matrix(sample(0:1, 6e5, replace = TRUE), 4)

funApply <- function(mat) {   # Matthew's Solution
  apply(array(mat, dim=c(4, 4, ncol(mat) / 4)), MARGIN=c(1,3), FUN=sum)
}
funRcpp <- function(mat) {    # David's Solution
  roll_sum(mat, 4, by.column = F)[, seq_len(ncol(mat) - 4 + 1)%%4 == 1]
}
library(microbenchmark)
microbenchmark(times=10,
  funSums(mat),
  funApply(mat),
  funRcpp(mat)
)

产地：

Unit: milliseconds
          expr        min         lq     median       uq       max neval
  funSums(mat)   4.035823   4.079707   5.256517   7.5359  42.06529    10
 funApply(mat) 379.124825 399.060015 430.899162 455.7755 471.35960    10
  funRcpp(mat)  18.481184  20.364885  38.595383 106.0277 132.93382    10

并检查：

all.equal(funSums(mat), funApply(mat))
# [1] TRUE
all.equal(funSums(mat), funRcpp(mat))
# [1] TRUE

关键点在于*Sums函数完全＆＃34;矢量化＆＃34;，所有计算都发生在C中。apply仍然需要做一堆不在R中严格矢量化（在原始C函数方式中），并且速度较慢（但更灵活）。

特定于这个问题，有可能使它快2-3倍，因为大约一半的时间花在转置上，这只是必要的，以便dim更改做我需要的{{{ 1}}工作。

Answer 2

将矩阵划分为3D阵列是一种方式：

apply(array(mat, dim=c(4, 4, 3)), MARGIN=c(1,3), FUN=sum)

#      [,1] [,2] [,3]
# [1,]    3    2    2
# [2,]    2    2    1
# [3,]    2    3    0
# [4,]    3    4    0

Answer 3

这是使用RcppRoll包

的另一种方法

library(RcppRoll) # Uses C++/Rcpp
n <- 4 # The summing range
roll_sum(mat, n, by.column = F)[, seq_len(ncol(mat) - n + 1) %% n == 1]

##      [,1] [,2] [,3]
## [1,]    3    2    2
## [2,]    2    2    1
## [3,]    2    3    0
#3 [4,]    3    4    0

Answer 4

这可能是最慢的：

set.seed(123)
mat <- matrix (sample(0:1, 48, replace = TRUE), 4)
mat

output <- sapply(seq(4,ncol(mat),4), function(i) { apply(mat,1,function(j){
      sum(j[c(i-3, i-2, i-1, i)], na.rm=TRUE)
})})

output

     [,1] [,2] [,3]
[1,]    3    2    2
[2,]    2    2    1
[3,]    2    3    0
[4,]    3    4    0

也许嵌套for-loops会慢一点，但这个答案非常接近于嵌套for-loops。

在R中的大矩阵中加入连续的四/ n数

4 个答案: