计算R中矩阵中每个唯一列的出现的最快方法

时间:2015-02-12 19:06:30

标签: r performance matrix aggregate

我是R(和stackoverflow)的新手,我将非常感谢你的帮助。我想计算矩阵中每个唯一列的出现次数。我写了下面的代码,但速度非常慢:

frequencyofequalcolumnsinmatrix = function(matrixM){

# returns a matrix columnswithfrequencyofmtxM that contains each distinct column and the frequency of each distinct columns on the last row. Hence  if the last row is c(3,5,3,2), then matrixM has 3+5+3+2=13 columns; there are 4 distinct columns; and the first distinct column appears 3 times, the second distinct column appears 5 times, etc.


n = nrow(matrixM)

columnswithfrequencyofmtxM = c()

while (ncol(matrixM)>0){

  indexzero = which(apply(matrixM-matrixM[,1], 2, function(x) identical(as.vector(x),rep(0,n))));

  indexnotzero = setdiff(seq(1:ncol(matrixM)),indexzero);

  frequencyofgivencolumn = c(matrixM[,1], length(indexzero)); #vector of length n. Coordinates 1 to nrow(matrixM) contains the coordinates of the given distinct column while coordinate nrow(matrixM)+1 contains the frequency of appearance of that column

  columnswithfrequencyofmtxM = cbind(columnswithfrequencyofmtxM,frequencyofgivencolumn, deparse.level=0);

  matrixM=matrixM[,indexnotzero];

  matrixM = as.matrix(matrixM);

  }

return(columnswithfrequencyofmtxM)


} 

如果我们申请矩阵' testmtx',我们会获得:

> testmtx = matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)
> frequencyofequalcolumnsinmatrix(testmtx)
     [,1] [,2] [,3]
[1,]    1    0    1
[2,]    2    1    2
[3,]    4    1    1
[4,]    2    3    1

其中最后一行包含上面列的出现次数。

对我的代码不满意,我浏览了stackoverflow。我发现了以下问题:

Fastest way to count occurrences of each unique element

可以看出,计算向量中每个唯一元素出现次数的最快方法是使用data.table()包。这是代码:

f6 <- function(x){
data.table(x)[, .N, keyby = x]
}

当我们运行它时,我们获得:

> vtr = c(1,2,3,1,1,2,4,2,4)
> f6(vtr)
   x N
1: 1 3
2: 2 3
3: 3 1
4: 4 2

我试图修改此代码以便在我的情况下使用它。这需要能够将 vtr 创建为一个向量,其中每个元素都是一个向量。但我还没有能够做到这一点。(很可能因为在R中,c(c(1,2),c(3,4))与c(1,2,3,4)相同)。

我应该尝试修改 f6 功能吗?如果是这样,怎么样? 或者我应该采取完全不同的方法?如果是这样,哪一个?

谢谢!

5 个答案:

答案 0 :(得分:2)

一种简单的方法是将您的行粘贴到一个向量中,然后使用该函数。

mat <- matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)

vec <- apply(mat, 2, paste, collapse=" ")

f6(vec)
     x N
1: 011 3
2: 121 1
3: 124 2

修改

@RohitDas的答案让我想到,在考虑表现时,最好先检查一下。如果我采用之前在问题中显示的所有功能,OP链接here并添加

f7 <- table

同时添加@DavidArenburg的f10建议

f10 <- function(x){ 
  table(unlist(data.table(x)[, lapply(.SD, paste, collapse = "")])) 
}

结果如下:

在@MaratTalipov添加解决方案后,它是明显的赢家。直接应用于矩阵,它比所有矢量解决方案都快。

set.seed(1)
testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=1000)

microbenchmark(
   f1(apply(testmx, 2, paste, collapse=" ")),
   f2(apply(testmx, 2, paste, collapse=" ")),
   f3(apply(testmx, 2, paste, collapse=" ")),
   f4(apply(testmx, 2, paste, collapse=" ")),
   f5(apply(testmx, 2, paste, collapse=" ")),
   f6(apply(testmx, 2, paste, collapse=" ")),
   f7(apply(testmx, 2, paste, collapse=" ")),
   f8(apply(testmx, 2, paste, collapse=" ")),
   f9(apply(testmx, 2, paste, collapse=" ")),
   f10(testmx),
   f11(testmx),
   f12(testmx)
   )
Unit: microseconds
                                       expr      min        lq      mean   median        uq       max neval
 f1(apply(testmx, 2, paste, collapse = " ")) 3311.770 3511.5620 3901.0020 3612.035 3849.3600  9569.987   100
 f2(apply(testmx, 2, paste, collapse = " ")) 3044.997 3263.6515 3667.9232 3430.914 3847.2430  6721.318   100
 f3(apply(testmx, 2, paste, collapse = " ")) 2032.179 2118.0245 2371.8638 2213.301 2430.4155  6631.624   100
 f4(apply(testmx, 2, paste, collapse = " ")) 2119.949 2218.3050 2497.1513 2286.442 2425.0260  6258.987   100
 f5(apply(testmx, 2, paste, collapse = " ")) 2131.498 2221.5775 2459.9300 2309.925 2530.3115  4222.575   100
 f6(apply(testmx, 2, paste, collapse = " ")) 3121.217 3367.7815 3738.3239 3486.155 3835.1175  7979.352   100
 f7(apply(testmx, 2, paste, collapse = " ")) 1766.175 1832.9650 2040.5483 1889.169 2032.1795  3784.110   100
 f8(apply(testmx, 2, paste, collapse = " ")) 2085.303 2169.2240 2435.6932 2237.168 2404.2380  5002.109   100
 f9(apply(testmx, 2, paste, collapse = " ")) 2802.090 2988.0230 3449.0685 3056.930 3373.1710 17640.957   100
                                f10(testmx) 4027.017 4251.6385 4865.7036 4399.461 4848.7035 11811.581   100
                                f11(testmx)  500.058  549.1395  624.9526  576.279  636.1395  1176.809   100
                                f12(testmx) 1827.769 1886.4740 1957.0555 1902.834 1964.4270  3600.487   100

答案 1 :(得分:1)

&#34;蛮力&#34;的方法:

f11 <- function(testmtx) {
  nc <- ncol(testmtx)
  z <- seq(nc)  
  for (i in seq(nc-1)) {
    dup <- sapply(seq(i+1,nc),function(j) identical(testmtx[,i],testmtx[,j]))
    z[which(dup)+i] <- z[i]
  }
  table(z)
}

它应该具有复杂度 O (N ^ 2 * M),其中N和M分别是列数和行数。另一个基于paste的解决方案具有复杂度 O (N * M ^ 2),因此它们的相对性能应该对N / M非常敏感。

[编辑] 实际上,我不确定基于paste的解决方案的复杂性 - 它很容易 O (N ^ 2 * M ^ 2)...

[EDIT2] 稍微更有效的替代函数f11(),它使用@ BrodieG比较矩阵列与矩阵的方法:

f13 <- function(testmtx) {
  nc <- ncol(testmtx)
  z <- seq(nc)  
  for (i in seq(nc-1)) {
    dup <- colSums(abs(testmtx[,seq(i+1,nc),drop=F] - testmtx[,i])) == 0
    z[which(dup)+i] <- z[i]
  }
  table(z)
}

答案 2 :(得分:1)

这应该有点效率。第一个目标是使用duplicated确定要计数的列,然后使用向量回收和colSums来计算每列的实例。

f12 <- function(testmx) {
  singles <- !duplicated(testmx, MARGIN=2)
  rbind(
    testmx[, singles],
    apply(testmx[, singles], 2, function(x) sum(colSums(abs(testmx - x)) == 0))  
  )    
}

产地:

     [,1] [,2] [,3]
[1,]    1    0    1
[2,]    2    1    2
[3,]    4    1    1
[4,]    2    3    1

这似乎比Marat的f11要快得多,但f6 + apply似乎可以解决问题:

set.seed(1)
testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=3)

library(microbenchmark)
microbenchmark(
  f12(testmx), 
  f11(testmx), 
  f6(apply(testmx, 2, paste, collapse="")), times=10
)

Unit: milliseconds
                                       expr         min          lq       mean
                                f12(testmx)   36.576060   36.931514   38.18358
                                f11(testmx) 2095.305540 2122.316487 2145.72614
 f6(apply(testmx, 2, paste, collapse = ""))    7.570614    7.601697    8.78227

答案 3 :(得分:1)

以下是f6prime

f6prime = function(mat) {
  dt = as.data.table(t(mat));
  dt[, .N, by = names(dt)]
}

f6prime(mat)
#   V1 V2 V3 N
#1:  1  2  4 2
#2:  0  1  1 3
#3:  1  2  1 1

答案 4 :(得分:0)

借用@cdeterman解决方案。获得已发布列值的向量后,您只需执行一个表来获取计数

table(vec)
vec
011 121 124 
  3   1   2