如何排序稀疏矩阵并存储结果

时间:2017-09-08 12:28:38

标签: r sparse-matrix

我有一个很大的稀疏矩阵:

> str(qtr_sim)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
  ..@ i       : int [1:32395981] 0 1 2 3 4 5 6 7 8 1 ...
  ..@ p       : int [1:28182] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ Dim     : int [1:2] 28181 28181
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:28181] "1000191" "1000404" "1000457" "1000541" ...
  .. ..$ : chr [1:28181] "1000191" "1000404" "1000457" "1000541" ...
  ..@ x       : num [1:32395981] 1 1 1 1 1 ...
  ..@ uplo    : chr "U"
  ..@ factors : list()

矩阵包含余弦相似度的值 - 0和1之间的数字。

这种矩阵的一个例子,其中A,...,E我将称之为"产品"

>A
5 x 5 sparse Matrix of class "dgCMatrix"
     A    B    C   D    E
A 1.00 0.51 .    .   0.03
B 0.51 1.00 0.40 .   0.06
C .    0.40 1.00 0.1 0.05
D .    .    0.10 1.0 .   
E 0.03 0.06 0.05 .   1.00


> dput(A)
new("dgCMatrix"
    , i = c(0L, 1L, 4L, 0L, 1L, 2L, 4L, 1L, 2L, 3L, 4L, 2L, 3L, 0L, 1L, 
2L, 4L)
    , p = c(0L, 3L, 7L, 11L, 13L, 17L)
    , Dim = c(5L, 5L)
    , Dimnames = list(c("A", "B", "C", "D", "E"), c("A", "B", "C", "D", "E"))
    , x = c(1, 0.51, 0.03, 0.51, 1, 0.4, 0.06, 0.4, 1, 0.1, 0.05, 0.1, 
1, 0.03, 0.06, 0.05, 1)
    , factors = list()
)

我需要找到一种快速的方法从矩阵A中获得两个矩阵B,C:

>B
5 x 5 sparse Matrix of class "dgCMatrix"
            A       B       C       D       E     
  [1,]   1.00    1.00    1.00     1.0    1.00
  [2,]   0.51    0.51    0.40     0.1    0.06      
  [3,]   0.03    0.40    0.10       .    0.05
  [4,]      .    0.06    0.05       .    0.03   
  [5,]      .       .       .       .       .

>C
            A       B       C       D       E     
  [1,]      A       B       C       D       E
  [2,]      B       A       B       C       B      
  [3,]      E       C       D      NA       C
  [4,]     NA       E       E      NA       A   
  [5,]     NA      NA      NA      NA      NA

必须是" NA"但我在我的代码中使用它(见下文)。

我使用这种方法:

  B <- C <- matrix(NA, nrow = nrow(A), ncol = ncol(A))
  colnames(C) <- colnames(B) <- colnames(A)

  for (j in 1:nrow(A)){
    c <- A[ ,2, drop = F]
    posi <- colnames(c)

    d <- order(c, decreasing = T)
    mat <- c[d, ]

    if (which(names(mat) == posi) != 1){
      firstr <- mat[which(names(mat) == posi)]
      mat <- mat[-which(names(mat) == posi)]
      mat <- c(firstr,mat)
    } #this is because sometimes similarity of value 1 doesn't
      #only belong to one products and I need first row = column 
      #names !!!! The next product with similarity 1 should be 
      #in next row and so on.


    myNAs <- lapply(mat, function(x) which(x == 0))
    a <- as.numeric(which(myNAs == 1))
    names(mat)[a] <- NA
    C[,j] <- names(mat)
    B[,j] <- as.numeric(mat)
  }

但这种做法确实很慢。请注意,原始稀疏矩阵比此示例A大得多。

如何改进我的方法?

1 个答案:

答案 0 :(得分:0)

好的,也许这是有用的:

library(data.table)
DT <- data.table(val = A@x, i = A@i + 1L, 
                 product = rownames(A)[A@i + 1L],
                 j = rep(rownames(A), diff(A@p)))
setorderv(DT, c("j", "val"), c(1L, -1L))
DT[, newi := seq_len(.N), by = j]

dcast(DT, newi ~ j, value.var = "val")
#   newi    A    B    C   D    E
#1:    1 1.00 1.00 1.00 1.0 1.00
#2:    2 0.51 0.51 0.40 0.1 0.06
#3:    3 0.03 0.40 0.10  NA 0.05
#4:    4   NA 0.06 0.05  NA 0.03
dcast(DT, newi ~ j, value.var = "product")
#   newi  A B C  D E
#1:    1  A B C  D E
#2:    2  B A B  C B
#3:    3  E C D NA C
#4:    4 NA E E NA A

当然,重塑可能会产生大量密集的物体,从而耗尽记忆力。如果这是一个问题,您需要撤消第一步并尝试使用newijval创建稀疏矩阵。