Question

这个问题延伸this post，与machine learning feature selection程序有关，我有一个很大的特征矩阵，我想执行一个快速而粗糙的{{1}通过测量每对要素与响应之间的外部产品之间的feature selection来衡量，因为我将使用correlation或random forest boosting。< / p>

功能数量约为60,000，响应数量约为2,200,000。

考虑到内存无限可能，最快的方法是生成classifier，其中列是所有功能对的外部产品，并使用matrix cor反对。作为一个较小的维度示例：

matrix

但是我的真实尺寸set.seed(1) feature.mat <- matrix(rnorm(2200*100),nrow=2200,ncol=100) response.vec <- rnorm(2200) #generate indices of all unique pairs of features and get the outer products: feature.pairs <- t(combn(1:ncol(feature.mat),2)) feature.pairs.prod <- feature.mat[,feature.pairs[,1]]*feature.mat[,feature.pairs[,2]] #compute the correlation coefficients res <- cor(feature.pairs.prod,response.vec)是2,200,000乘1,799,970,000，显然无法存储在内存中。

所以我的问题是，是否以及如何在合理的计算时间内获得所有相关性？

我在考虑可能会将feature.pairs.prod分解为适合记忆的块，然后在它们之间feature.pairs.prod和cor一次一个，这将是最快但我和＃39} ;我不确定如何在response.vec中自动测试我需要这些块的尺寸。

另一个选项是R一个函数超过apply，它将计算外部产品，然后在feature.pairs之间计算cor。

有什么建议吗？

Answer 1

是的，大块计算是可行的方法。在Out of memory when using outer in solving my big normal equation for least squares estimation中也可以这样做。

无需更改步骤：

set.seed(1)
feature.mat <- matrix(rnorm(2200*100),nrow=2200,ncol=100)
response.vec <- rnorm(2200)

#generate indices of all unique pairs of features and get the outer products:
feature.pairs <- t(combn(1:ncol(feature.mat),2))
j1 <- feature.pairs[,1]
j2 <- feature.pairs[,2]

但是我们需要将j1和j2分成几个块：

## number of data
n <- nrow(feature.mat)
## set a chunk size
k <- 1000
## start and end index of each chunk
start <- seq(1, length(j1), by = k)
end <- c(start[-1] - 1, length(j1))

## result for the i-th chunk
chunk_cor <- function (i) {
  jj <- start[i]:end[i]
  jj1 <- j1[jj]; jj2 <- j2[jj]
  feature.pairs.prod <- feature.mat[,jj1] * feature.mat[,jj2]
  cor(feature.pairs.prod,response.vec)
  }

## now we loop through all chunks and combine the result
res <- unlist(lapply(1:length(start), chunk_cor))

主要问题是如何决定k。

如链接答案所示，我们可以计算内存占用量。如果您有n行和k列（块大小），则n * k矩阵的内存成本为n * k * 8 / 1024 / 1024/ 1024 GB。您可以在输入时设置内存限制;然后，由于n已知，您可以解决k。

检查功能f的内存成本：feature.mat[,jj1]，feature.mat[,jj2]和feature.pairs.prod都需要生成和存储。所以我们有内存大小：

3 * n * k * 8 / 1024 / 1024/ 1024 GB

现在假设我们想要限制4GB下的内存占用，给定n，我们可以解决k：

k <- floor(4 * 2^30 / (24 * n))

大矩阵和向量中每列之间的相关性的记忆和时间有效竞争

1 个答案: