Question

我有这个大二进制data.table：

> str(mat)
Classes 'data.table' and 'data.frame':  262561 obs. of  10615 variables:
 $ 1001682: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1001990: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002541: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002790: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1003312: num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1004403: num  0 0 0 0 0 0 0 0 0 0 ...

有一个地方（它没有充满零）。而我正试图通过编写data.matrix将其转换为mat <- data.matrix(mat)，但R会话总是中止。这是我的电脑有问题吗？我应该尝试一些高性能的电脑吗？或者还有其他方法可以做到这一点吗？我需要data.matrix。

我使用的是2015年初的macbook pro，配备2.7 GHz Intel Core i5和8Gm DDR3。

Answer 1

以下是将data.table转换为稀疏矩阵的方法：

library(data.table)
library(Matrix)
DT <- fread("A B C D E
            0 1 0 1 0
            1 0 0 0 0
            1 1 1 0 1")

ncol <- length(DT)
nrow <- nrow(DT)
dimnames <- names(DT)

DT <- melt(DT)
inds <- DT[, which(as.logical(value))]
i <- (inds -1) %% nrow + 1
j <- (inds - 1) %/% nrow + 1

DT <- DT[value == 1]
DT <- sparseMatrix(i = i, j = j, x = TRUE, dims = c(nrow, ncol), dimnames = list(NULL, dimnames))
#3 x 5 sparse Matrix of class "lgCMatrix"
#     A B C D E
#[1,] . | . | .
#[2,] | . . . .
#[3,] | | | . |

目前还不清楚你想对数据做什么，但稀疏矩阵是这里最节省内存的数据结构。当然，您计划使用的功能必须能够处理这样的结构。

修改

OP想要计算余弦相似度。

library(qlcMatrix) cosSparse(DT) #5 x 5 sparse Matrix of class "dsCMatrix" # A B C D E #A 1.0000000 0.5000000 0.7071068 . 0.7071068 #B 0.5000000 1.0000000 0.7071068 0.7071068 0.7071068 #C 0.7071068 0.7071068 1.0000000 . 1.0000000 #D . 0.7071068 . 1.0000000 . #E 0.7071068 0.7071068 1.0000000 . 1.0000000

Answer 2

我不确定这是否比Roland的方法更有效，但它产生相同的矩阵并且不需要重新整形数据。它确实需要与my previous answer中的lapply基本相同的OP，她称之为慢。使用罗兰答案中构建的data.table。

library(Matrix)

# get positions of non-zero values in data.table
myRows <- lapply(DT, function(x) which(x != 0))

# build sparse matrix
DT <- sparseMatrix(i = unlist(myRows), # row positions of non zero values
                   j = rep(seq_along(myRows), lengths(myRows)), # column positions
                   dims = c(nrow(DT), ncol(DT))) # dimension of matrix

返回

DT
3 x 5 sparse Matrix of class "lgCMatrix"

[1,] . | . | .
[2,] | . . . .
[3,] | | | . |

R会话在data.matrix转换期间中止

2 个答案: