Question

我正在尝试cbind拥有一个具有数据帧的非常大的矩阵，由于矩阵的大小，我遇到了内存问题。

我有数据：

set.seed(123)
df1 <- data.frame(replicate(5, sample(1:20, 10, rep=TRUE)))
colnames(df1) <- c("col1", "col2", "col3", "col4", "important_col")
df2 <- data.frame(replicate(20, sample(0:0, nrow(df1), rep=TRUE)))
colnames(df2) <- gsub("X", "", colnames(df2))
df_fin <- cbind(df1, df2)

以下是我希望在一个小样本上执行的工作，但是当应用于成千上万的行和1000+的列时，我遇到了内存问题。

vecp <- colnames(df2)

imp_col <- df1$important_col

matrix <-  matrix(vecp, byrow = TRUE,
                           nrow = length(imp_col),
                           ncol = length(vecp),
                           dimnames = list(1:length(imp_col), vecp))

d <- ifelse(matrix == imp_col, 1, 0)


df_fin <- cbind(df1, d)

我试图在第d <- ifelse(matrix == imp_col, 1, 0)行提高代码效率（就是我遇到内存问题的地方）。

在应用sparse语句之前，有没有办法使矩阵成为ifesle矩阵。

我建立了如下矩阵：

   col1 col2 col3 col4 important_col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1    11   14    3   11             1 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
2     1    1   19   15             4 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
3     3   17   10   10             6 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
4    13   10    8   17            10 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
5    18    5    3   18            19 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
6    11   10    9    5            17 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
7     5   11   18   16            17 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
8     5    8   13    8             6 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
9    10    1    7   16            12 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
10    4   17   17    3             4 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0

最终产品如下：

   col1 col2 col3 col4 important_col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1     6   20   18   20             3 0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
2    16   10   14   19             9 0 0 0 0 0 0 0 0 1  0  0  0  0  0  0  0  0  0  0  0
3     9   14   13   14             9 0 0 0 0 0 0 0 0 1  0  0  0  0  0  0  0  0  0  0  0
4    18   12   20   16             8 0 0 0 0 0 0 0 1 0  0  0  0  0  0  0  0  0  0  0  0
5    19    3   14    1             4 0 0 0 1 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
6     1   18   15   10             3 0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
7    11    5   11   16             5 0 0 0 0 1 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
8    18    1   12    5            10 0 0 0 0 0 0 0 0 0  1  0  0  0  0  0  0  0  0  0  0
9    12    7    6    7             6 0 0 0 0 0 1 0 0 0  0  0  0  0  0  0  0  0  0  0  0
10   10   20    3    5            18 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  1  0  0

然后我将把它变成一个稀疏矩阵。

Answer 1

问题在于d与矩阵的大小相同，因此，如果矩阵很大，则将有两个。一种可能的选择（尽管可能更慢）是遍历各列并一次更改它们，这只会创建与矩阵的一列大小相同的对象。您可以尝试一下：

for (i in 1:ncol(matrix)) matrix[, i] <- matrix[, i] == imp_col

表达式返回一个布尔值，但是如果您的矩阵由整数组成，则它们将被转换为0和1。

填充矩阵的更有效方法

1 个答案: