我在R中创建了一个脚本,用于计算一堆DNA序列文件(fasta格式)中的核苷酸多样性。我可以使用嵌套的for循环来做到这一点(请参见下面的代码)。但是,它的计算效率很低。我已经尝试过ifelse和sapply函数,但是可以弄清楚如何使其起作用。有人可以帮我优化这段代码吗?
# This code works but it’s very inefficient:
library(pegas)
setwd(dir="d:/my_directory")
file.names<-dir(pattern=".fasta")
# Create a nucleotide diversity matrix:
exple<-matrix(nrow=10,ncol=10)
rownames(exple)<-paste("sample",c(1:10),sep="_")
colnames(exple)<-paste("species",c(1:10),sep="_")
# Create a function to read DNA sequences and calculate nucleotide diversity:
pi=function(x,y){
my_seq<-read.dna(paste(x,y,"fasta",sep="."),format="fasta",as.matrix=FALSE)
nuc_div<-nuc.div(my_seq)
}
# Iterate over rows and columns
for(m in 1:nrow(exple)){
for(o in 1:ncol(exple)){
if(paste(colnames(exple)[o],rownames(exple)[m],"fasta",sep=".") %in% dir()){
divp <- pi(colnames(exple)[o],rownames(exple)[m])
exple[m,o]<-divp
}
}
}
我尝试(提高效率)的尝试:
exple2<-melt(exple,varnames=c("sample","species"))
exple2$exist<-ifelse(paste(exple2$species,exple2$sample,"fasta",sep=".") %in% dir(),1,0)
exple2$value<-ifelse(exple2$exist==1,
sapply(exple2$sample, function(x){
pi(exple2$species,exple2$sample)
}),"NA")
# I get this error
Error in file(con, "rb") : invalid 'description' argument
# Traceback
10. file(con, "rb")
9. readBin(file, "raw", sz)
8. read.FASTA(file)
7. read.dna(paste(x, y, "fasta", sep = "."), format = "fasta", as.matrix = FALSE)
6. pi(exple2$otu_id, exple2$sample_id)
5. FUN(X[[i]], ...)
4. lapply(X = X, FUN = FUN, ...)
3. sapply(exple2$sample_id, function(x) { pi(exple2$otu_id, exple2$sample_id) })
2. sapply(exple2$sample_id, function(x) { pi(exple2$otu_id, exple2$sample_id) })
1. ifelse(exple2$exist == 1, sapply(exple2$sample_id, function(x) { pi(exple2$otu_id, exple2$sample_id) }), "NA")
答案 0 :(得分:0)
由于将多个元素的向量传递到read.dna
文件参数中而导致错误,该参数期望长度为1。由于您实际上是在运行两个字符向量之间的所有组合,因此考虑使用mapply
逐个循环expand.grid
数据帧以构建所需的结果矩阵。另外,对丢失的文件使用tryCatch
:
pi <- function(x, y) {
tryCatch({
my_seq <- read.dna(paste(x, y, "fasta", sep="."),
format="fasta", as.matrix=FALSE)
return(nuc.div(my_seq))
}, error = function (e) return(NULL)
)
}
# DATA FRAME OF ALL POSSIBLE COMBINATIONS (nrow = 100, ncol = 2)
params_df <- expand.grid(sample = paste("sample", c(1:10), sep="_"),
species = paste("species", c(1:10), sep="_"))
# CAST NUMERIC VECTOR INTO MATRIX OF DEFINED DIMS
exple <- matrix(mapply(pi, params_df$species, params_df$sample),
nrow = 10, ncol = 10,
dimnames = list(paste("sample", c(1:10), sep="_"),
paste("species", c(1:10), sep="_"))
)
如上所述,由于您正在I / O进程中迭代读取文件,因此无法进行矢量化,此解决方案可能不会更快,但可以避免将多个嵌套for
循环到一个隐藏的 apply中家庭循环。