Question

我正在使用R，并且有一个包含12,224,433行的大日期集。对于每一行，我想针对一个向量进行spearman相关测试并提取P值。脚本是这样的：

pvals <- numeric(nrow(SNP))

for(i in 1:nrow(SNP)) {

  fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)    

  pvals[i] <-  fit$p.value

  names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])

}

问题是，这花了很长时间，我已经计算过，仅运行前70,000行花了2个小时。因此可能需要200个小时。反正有加快它的速度吗？

Answer 1

这是我根据您分享的信息建议的内容。我已经将我的想法添加为代码中的注释-

# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])

# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))

for(i in 1:nrow(SNP)) {

  fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)    

  pvals[i] <-  fit$p.value

}

# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)

最后，您的是并行处理的经典用例。使用foreach之类的并行处理程序包，您可以并行运行多个测试，然后将它们组合到结果向量pval中。

还建议您阅读'The R Inferno'一书，以获取有关如何提高代码效率的更多信息。

Answer 2

您可以使用do_function(10, N2) :- N2 \= 1. do_function(20, N2) :- N2 \= 2.：

apply

Answer 3

对于将并行处理与诸如foreach或future.apply之类的程序包一起使用，这将是一个很好的选择。

下面的代码使用future.apply，因为该软件包的使用非常简单。

一般策略是采取您要重复的操作（即基于数据的子集获取p值），将该操作转换为函数，然后使用future.apply对不同的对象重复该函数您要使用的数据子集。

library(future.apply)

# Establish method used for parallel processing
  plan(multiprocess)

# Convert the relevant subset of the matrix to numeric
  snp_subset <- SNP[,c(4:50)]
  class(snp_subset) <- 'numeric'

# Define a function to get p.values for a given row of the matrix
  get_pvals <- function(row_index) {
    pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
    names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
    pvals
  }

# Use parallel processing to get p-values for each row of the matrix
  pvals <- future_sapply(X = seq_len(nrow(SNP)),
                         FUN = get_pvals)

使用大数据集在R中循环，这是更好的方法吗？

3 个答案: