在R中的其他列中匹配的两个不同列中提取值

时间:2015-09-21 00:29:33

标签: r algorithm matrix

我有一个名为mymat的矩阵(近似为446664 X 234)。它有REFALT列,可以包含A,T,G,C字母(只有一个字母)。在以.GT结尾的列中,我想替换这些字母。要匹配的条件是,如果有0,我想用REF列中的字母替换它,如果有1,那么我想用ALT列中的字母替换它。如果有NA,我想用“0”“0”(即零空间零)替换它。最后,我需要反转行中的所有.GT列(转置),如结果所示。结果,一切都被空格分开。

 mymat<-structure(c("G", "A", "C", "A", "G", "A", "C", "T", "G", "A", 
"1/1", "0/0", "0/0", "NA", "NA", "0,15", "8,0", "8,0", "NA", 
"NA", "1/1", "0/1", "0/0", "NA", "NA", "0,35", "12,12", "15,0", 
"NA", "NA"), .Dim = 5:6, .Dimnames = list(c("chrX:133511988:133511988:G:A:snp", 
"chrX:133528116:133528116:A:C:snp", "chrX:133528186:133528186:C:T:snp", 
"chrX:133560301:133560301:A:G:snp", "chrX:133561242:133561242:G:A:snp"
), c("REF", "ALT", "02688.GT", "02688.AD", "02689.GT", "02689.AD"
)))

结果

02688.GT  A A A A C C 0 0 0 0
02689.GT  A A A C C C 0 0 0 0 

3 个答案:

答案 0 :(得分:2)

你可以尝试:

library(dplyr)
library(stringi)

## convert to data.frame 
data.frame(mymat, check.names = FALSE) %>%
  ## replace the values ("0", "1", "/", "NA") in all columns ending with ".GT" with
  ## the corresponding values in "REF" and "ALT" (" " for "/" and "0 0" for "NA")
  mutate_each(funs(stri_replace_all(., REF, fixed = "0")), ends_with(".GT")) %>%
  mutate_each(funs(stri_replace_all(., ALT, fixed = "1")), ends_with(".GT")) %>%
  mutate_each(funs(stri_replace_all(., " ", fixed = "/")), ends_with(".GT")) %>%
  mutate_each(funs(stri_replace_all(., "0 0", fixed = "NA")), ends_with(".GT")) %>%
  ## keep only the columns ending with ".GT"
  select(ends_with(".GT")) %>%
  ## transpose the results
  t()

给出了:

         [,1]  [,2]  [,3]  [,4]  [,5] 
02688.GT "A A" "A A" "C C" "0 0" "0 0"
02689.GT "A A" "A C" "C C" "0 0" "0 0"

答案 1 :(得分:0)

我发布了自己的答案,但实际上很慢,所以需要进一步优化。

       letters <- strsplit(paste(mymat[,"REF"],mymat[,"ALT"],sep=","),",") # concatenate the letters to have an index to work on from the numbers
values <- t(mymat[,c(which(colnames(mymat)%in%lapply(all.samples,function(x)(paste(x,"GT",sep=".")))))]) # working on each column needing values
nbval <- ncol(values) # Keeping track of total number of columns and saving the length of values 

#Preparing the two temp vectors to be used below
chars <- vector("character",2) 
ret <- vector("character",nbval)

#Loop over the rows (and transpose the result)
mydata<-t(sapply(rownames(values),
                 function(x) { 
                   indexes <- strsplit(values[x,],"/") # Get a list with pairs of indexes

                   for(i in 1:nbval) { # Loop over the number of columns :/
                     for (j in 1:2) { # Loop over the pair 
                       chars[j] <- ifelse(indexes[i] == "NA", 0,letters[[i]][as.integer(indexes[[i]][j])+1]) # Get '0' if "NA" or the letter with the correct index at this postion
                     }
                     ret[i] <- paste(chars[1],chars[2], sep=" ") # concatenate the two chars
                   }
                   return(ret) # return this for this row
                 }
))

答案 2 :(得分:0)

所以这只是一个部分答案,我不知道它将如何与&gt;一起使用200000行。但也许有人聪明人会弄清楚如何更好地做到这一点。

temp1 = strsplit(mymat[,3],"/")
reps = sapply(temp1,length)
refalt = data.frame(REF = rep(mymat[,1],times=reps),ALT = rep(mymat[,2],times=reps),ZERO = "0 0")
GT1 = unlist(temp1)
GT1[GT1=="NA"] = "2"
GT1 = as.numeric(GT1)+1
paste(refalt[cbind(1:8,GT1)]," ")

它是不完整的,因为我们需要将它包装在一个可以传递给apply()或lapply()的函数中,并在行的开头捕获变量名。