我有一个很大的稀疏矩阵:
> str(qtr_sim)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
..@ i : int [1:32395981] 0 1 2 3 4 5 6 7 8 1 ...
..@ p : int [1:28182] 0 1 2 3 4 5 6 7 8 9 ...
..@ Dim : int [1:2] 28181 28181
..@ Dimnames:List of 2
.. ..$ : chr [1:28181] "1000191" "1000404" "1000457" "1000541" ...
.. ..$ : chr [1:28181] "1000191" "1000404" "1000457" "1000541" ...
..@ x : num [1:32395981] 1 1 1 1 1 ...
..@ uplo : chr "U"
..@ factors : list()
矩阵包含余弦相似度的值 - 0和1之间的数字。
这种矩阵的一个例子,其中A,...,E我将称之为"产品" :
>A
5 x 5 sparse Matrix of class "dgCMatrix"
A B C D E
A 1.00 0.51 . . 0.03
B 0.51 1.00 0.40 . 0.06
C . 0.40 1.00 0.1 0.05
D . . 0.10 1.0 .
E 0.03 0.06 0.05 . 1.00
> dput(A)
new("dgCMatrix"
, i = c(0L, 1L, 4L, 0L, 1L, 2L, 4L, 1L, 2L, 3L, 4L, 2L, 3L, 0L, 1L,
2L, 4L)
, p = c(0L, 3L, 7L, 11L, 13L, 17L)
, Dim = c(5L, 5L)
, Dimnames = list(c("A", "B", "C", "D", "E"), c("A", "B", "C", "D", "E"))
, x = c(1, 0.51, 0.03, 0.51, 1, 0.4, 0.06, 0.4, 1, 0.1, 0.05, 0.1,
1, 0.03, 0.06, 0.05, 1)
, factors = list()
)
我需要找到一种快速的方法从矩阵A中获得两个矩阵B,C:
>B
5 x 5 sparse Matrix of class "dgCMatrix"
A B C D E
[1,] 1.00 1.00 1.00 1.0 1.00
[2,] 0.51 0.51 0.40 0.1 0.06
[3,] 0.03 0.40 0.10 . 0.05
[4,] . 0.06 0.05 . 0.03
[5,] . . . . .
>C
A B C D E
[1,] A B C D E
[2,] B A B C B
[3,] E C D NA C
[4,] NA E E NA A
[5,] NA NA NA NA NA
必须是" NA"但我在我的代码中使用它(见下文)。
我使用这种方法:
B <- C <- matrix(NA, nrow = nrow(A), ncol = ncol(A))
colnames(C) <- colnames(B) <- colnames(A)
for (j in 1:nrow(A)){
c <- A[ ,2, drop = F]
posi <- colnames(c)
d <- order(c, decreasing = T)
mat <- c[d, ]
if (which(names(mat) == posi) != 1){
firstr <- mat[which(names(mat) == posi)]
mat <- mat[-which(names(mat) == posi)]
mat <- c(firstr,mat)
} #this is because sometimes similarity of value 1 doesn't
#only belong to one products and I need first row = column
#names !!!! The next product with similarity 1 should be
#in next row and so on.
myNAs <- lapply(mat, function(x) which(x == 0))
a <- as.numeric(which(myNAs == 1))
names(mat)[a] <- NA
C[,j] <- names(mat)
B[,j] <- as.numeric(mat)
}
但这种做法确实很慢。请注意,原始稀疏矩阵比此示例A大得多。
如何改进我的方法?
答案 0 :(得分:0)
好的,也许这是有用的:
library(data.table)
DT <- data.table(val = A@x, i = A@i + 1L,
product = rownames(A)[A@i + 1L],
j = rep(rownames(A), diff(A@p)))
setorderv(DT, c("j", "val"), c(1L, -1L))
DT[, newi := seq_len(.N), by = j]
dcast(DT, newi ~ j, value.var = "val")
# newi A B C D E
#1: 1 1.00 1.00 1.00 1.0 1.00
#2: 2 0.51 0.51 0.40 0.1 0.06
#3: 3 0.03 0.40 0.10 NA 0.05
#4: 4 NA 0.06 0.05 NA 0.03
dcast(DT, newi ~ j, value.var = "product")
# newi A B C D E
#1: 1 A B C D E
#2: 2 B A B C B
#3: 3 E C D NA C
#4: 4 NA E E NA A
当然,重塑可能会产生大量密集的物体,从而耗尽记忆力。如果这是一个问题,您需要撤消第一步并尝试使用newi
,j
和val
创建稀疏矩阵。