Question

我想优化我的代码（如下所示），这非常慢（特别是在循环阶段）。我有一个样本数据mydata（实际矩阵是666,669 X 4345大小）。这段代码的作用是查找mymat中REF和ALT列中的字母，它们可以包含A，T，G，C字母（只有一个字母）。在以.GT结尾的列中，将这些字母替换为条件：如果有0，则将其替换为REF列中的字母，如果有1则将其替换为ALT列中的字母。如果有NA，则将其替换为＆＃34; 0＆＃34; ＆＃34; 0＆＃34; （即零空间零）。最后，它会反转行中的所有.GT列（转置），如结果所示。结果，一切都被空间隔开。有没有办法优化这段代码？代码：

    all.samples<-c("02688","02689")
    letters <- strsplit(paste(mymat[,"REF"],mymat[,"ALT"],sep=","),",") # concatenate the letters to have an index to work on from the numbers
    values <- t(mymat[,c(which(colnames(mymat)%in%lapply(all.samples,function(x)(paste(x,"GT",sep=".")))))]) # working on each column needing values
    nbval <- ncol(values) # Keeping track of total number of columns and saving the length of values 

    #Preparing the two temp vectors to be used below
    chars <- vector("character",2) 
    ret <- vector("character",nbval)

    #This is where everything becomes slow; Loop over the rows (and transpose the result)
    mydata<-t(sapply(rownames(values),
                     function(x) { 
                       indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes

                       for(i in 1:nbval) { # Loop over the number of columns :/
                         for (j in 1:2) { # Loop over the pair 
                           chars[j] <- ifelse(indexes[i] == "NA", 0,letters[[i]][as.integer(indexes[[i]][j])+1]) # Get '0' if "NA" or the letter with the correct index at this postion
                         }
                         ret[i] <- paste(chars[1],chars[2], sep=" ") # concatenate the two chars
                       }
                       return(ret) # return this for this row
                     }
    ))

示例数据：

mymat<-structure(c("G", "A", "C", "A", "G", "A", "C", "T", "G", "A", 
"1/1", "0/0", "0/0", "NA", "NA", "0,15", "8,0", "8,0", "NA", 
"NA", "1/1", "0/1", "0/0", "NA", "NA", "0,35", "12,12", "15,0", 
"NA", "NA"), .Dim = 5:6, .Dimnames = list(c("chrX:133511988:133511988:G:A:snp", 
"chrX:133528116:133528116:A:C:snp", "chrX:133528186:133528186:C:T:snp", 
"chrX:133560301:133560301:A:G:snp", "chrX:133561242:133561242:G:A:snp"
), c("REF", "ALT", "02688.GT", "02688.AD", "02689.GT", "02689.AD"
)))

结果

    >mydata
    02688.GT  A A A A C C 0 0 0 0
    02689.GT  A A A C C C 0 0 0 0

如何优化此R脚本性能

0 个答案: