我试图将包含列表(包含可变长度的元素)的矩阵转换为稀疏矩阵。这是一个玩具示例:
mOrig = matrix(
c(rep(c('a_b', 'X'), 3),
rep(c('a_b_e', 'X'), 2),
rep(c('a_b_f', 'X'), 1),
rep(c('c_d', 'Y'), 3),
rep(c('c_d_e', 'Y'), 2),
rep(c('c_d_f', 'Y'), 1)),
ncol=2, byrow=TRUE
)
colnames(mOrig) = c('in', 'out')
mOrig
in out
[1,] "a_b" "X"
[2,] "a_b" "X"
[3,] "a_b" "X"
[4,] "a_b_e" "X"
[5,] "a_b_e" "X"
[6,] "a_b_f" "X"
[7,] "c_d" "Y"
[8,] "c_d" "Y"
[9,] "c_d" "Y"
[10,] "c_d_e" "Y"
[11,] "c_d_e" "Y"
[12,] "c_d_f" "Y"
输出矩阵应如下所示:
a b c d e f X Y
[1,] 1 1 0 0 0 0 1 0
[2,] 1 1 0 0 0 0 1 0
[3,] 1 1 0 0 0 0 1 0
[4,] 1 1 0 0 1 0 1 0
[5,] 1 1 0 0 1 0 1 0
[6,] 1 1 0 0 0 1 1 0
[7,] 0 0 1 1 0 0 0 1
[8,] 0 0 1 1 0 0 0 1
[9,] 0 0 1 1 0 0 0 1
[10,] 0 0 1 1 1 0 0 1
[11,] 0 0 1 1 1 0 0 1
[12,] 0 0 1 1 0 1 0 1
我接近解决方案,但现在使用unique(unlist(strsplit()))
和for
循环看起来效率极低。是否有人知道一些有效的解决方案,例如,利用sparseMatrix
(或来自sparse.model.matrix
包的Matrix
)?
非常感谢!
答案 0 :(得分:0)
写入稀疏矩阵的最快方法之一似乎是使用myMatrix[matrix] <- value
形式。这在下面使用,以及lapply和strsplit。
library(Matrix)
mx <- Matrix(0,12,8, dimnames = list(NULL, c(letters[1:6], LETTERS[24:25])))
mOrig_split <- strsplit(mOrig[,'in'], '_')
long_fm <- do.call(rbind, lapply(seq_along(mOrig_split), function(x) {
cbind(x,c(mOrig_split[[x]], mOrig[x,2]))}))
mx[cbind(as.numeric(long_fm[,1]), match(long_fm[,2], colnames(mx)))] <- 1
mx
预先进行匹配可能会稍微快一些,从而节省了从数字转换为字符和返回的转换:
mx <- Matrix(0,12,8, dimnames = list(NULL, c(letters[1:6], LETTERS[24:25])))
mOrig_split <- lapply(strsplit(mOrig[,'in'], '_'), match, colnames(mx))
mOrig_out <- match(mOrig[,2], colnames(mx))
long_fm <- do.call(rbind, lapply(seq_along(mOrig_split), function(x) {
cbind(x,c(mOrig_split[[x]], mOrig_out[x]))}))
mx[long_fm] <- 1