我想优化我的代码(如下所示),这非常慢(特别是在循环阶段)。我有一个样本数据mydata
(实际矩阵是666,669 X 4345大小)。这段代码的作用是查找mymat
中REF和ALT列中的字母,它们可以包含A,T,G,C字母(只有一个字母)。在以.GT结尾的列中,将这些字母替换为条件:如果有0,则将其替换为REF列中的字母,如果有1则将其替换为ALT列中的字母。如果有NA,则将其替换为" 0" " 0" (即零空间零)。最后,它会反转行中的所有.GT列(转置),如结果所示。结果,一切都被空间隔开。有没有办法优化这段代码?
代码:
all.samples<-c("02688","02689")
letters <- strsplit(paste(mymat[,"REF"],mymat[,"ALT"],sep=","),",") # concatenate the letters to have an index to work on from the numbers
values <- t(mymat[,c(which(colnames(mymat)%in%lapply(all.samples,function(x)(paste(x,"GT",sep=".")))))]) # working on each column needing values
nbval <- ncol(values) # Keeping track of total number of columns and saving the length of values
#Preparing the two temp vectors to be used below
chars <- vector("character",2)
ret <- vector("character",nbval)
#This is where everything becomes slow; Loop over the rows (and transpose the result)
mydata<-t(sapply(rownames(values),
function(x) {
indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes
for(i in 1:nbval) { # Loop over the number of columns :/
for (j in 1:2) { # Loop over the pair
chars[j] <- ifelse(indexes[i] == "NA", 0,letters[[i]][as.integer(indexes[[i]][j])+1]) # Get '0' if "NA" or the letter with the correct index at this postion
}
ret[i] <- paste(chars[1],chars[2], sep=" ") # concatenate the two chars
}
return(ret) # return this for this row
}
))
示例数据:
mymat<-structure(c("G", "A", "C", "A", "G", "A", "C", "T", "G", "A",
"1/1", "0/0", "0/0", "NA", "NA", "0,15", "8,0", "8,0", "NA",
"NA", "1/1", "0/1", "0/0", "NA", "NA", "0,35", "12,12", "15,0",
"NA", "NA"), .Dim = 5:6, .Dimnames = list(c("chrX:133511988:133511988:G:A:snp",
"chrX:133528116:133528116:A:C:snp", "chrX:133528186:133528186:C:T:snp",
"chrX:133560301:133560301:A:G:snp", "chrX:133561242:133561242:G:A:snp"
), c("REF", "ALT", "02688.GT", "02688.AD", "02689.GT", "02689.AD"
)))
结果
>mydata
02688.GT A A A A C C 0 0 0 0
02689.GT A A A C C C 0 0 0 0