R列用数据框中的其他列替换字母

时间:2017-06-05 08:04:55

标签: r function dataframe bioinformatics

我有一个数据框,每行有三个参考列ref,het和hom,我想要替换G = C,A = T,AG = TC的列中的字母/基因型,反之亦然参考栏目。

structure(list(SNP = c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", 
"rs7", "rs8", "rs9"), ref = c("GG", "AA", "AA", "GG", "GG", "GG", 
"AA", "CC", "GG"), het = c("AG", "AG", "AG", "AG", "AG", "AG", 
"AG", "AC", "AG"), hom = c("AA", "GG", "GG", "AA", "AA", "AA", 
"GG", "AA", "AA"), A = c("TC", "TC", "CC", "AG", "TT", "TC", 
"AA", "GG", "GG"), B = c("CC", "TT", "CC", "AG", "TT", "CC", 
"AA", "TG", "GG"), C = c("CC", "CC", "CC", "GG", "CC", "TT", 
"AA", "TG", "GG"), D = c("TT", "TC", "CC", "AG", "TT", "TT", 
"AA", "GG", "AG"), E = c("CC", "TT", "CC", "AG", "TC", "TT", 
"AA", "TG", "GG"), F = c("TC", "TT", "TC", "GG", "TC", "TC", 
"AA", "GG", "GG"), G = c("TC", "TC", "CC", "AG", "TC", "TC", 
"AA", "GG", "GG"), H = c("TC", "TC", "TC", "GG", "TC", "TC", 
"AA", "TG", "GG")), .Names = c("SNP", "ref", "het", "hom", "A", 
"B", "C", "D", "E", "F", "G", "H"), class = "data.frame", row.names = 
c(NA, 
-9L))

Input:
SNP ref het hom A   B   C   D   E   F   G   H   I
rs1 GG  AG  AA  TC  CC  CC  TT  CC  TC  TC  TC  …
rs2 AA  AG  GG  TC  TT  CC  TC  TT  TT  TC  TC  …
rs3 AA  AG  GG  CC  CC  CC  CC  CC  TC  CC  TC  …
rs4 GG  AG  AA  AG  AG  GG  AG  AG  GG  AG  GG  …
rs5 GG  AG  AA  TT  TT  CC  TT  TC  TC  TC  TC  …
rs6 GG  AG  AA  TC  CC  TT  TT  TT  TC  TC  TC  …
rs7 AA  AG  GG  AA  AA  AA  AA  AA  AA  AA  AA  …
rs8 CC  AC  AA  GG  TG  TG  GG  TG  GG  GG  TG  …
rs9 GG  AG  AA  GG  GG  GG  AG  GG  GG  GG  GG  …

Desired Output:
SNP ref het hom A   B   C   D   E   F   G   H   I
rs1 GG  AG  AA  AG  GG  GG  AA  GG  AG  AG  AG  …
rs2 AA  AG  GG  AG  AA  GG  AG  AA  AA  AG  AG  …
rs3 AA  AG  GG  GG  GG  GG  GG  GG  AG  GG  AG  …
rs4 GG  AG  AA  AG  AG  GG  AG  AG  GG  AG  GG  …
rs5 GG  AG  AA  AA  AA  GG  AA  AG  AG  AG  AG  …
rs6 GG  AG  AA  AG  GG  AA  AA  AA  AG  AG  AG  …
rs7 AA  AG  GG  AA  AA  AA  AA  AA  AA  AA  AA  …
rs8 CC  AC  AA  AA  AC  AC  CC  AC  CC  CC  AC  …
rs9 GG  AG  AA  GG  GG  GG  AG  GG  GG  GG  GG  …

如何编写一个函数来根据参考列替换这些字母?谢谢。

2 个答案:

答案 0 :(得分:2)

我们可以创建一个包含所有可能基因型及其对应关系的“字典”,而不是通过SNP列表,检查第一个元素(A列)。如果它不在ref / het / hom中,那么我们假设需要更改该行中的元素,否则我们只是按原样返回该行。

key = list(AA="TT",TT="AA",
           GG="CC",CC="GG",
           AG="TC",TC="AG",
           GA="CT",CT="GA",
           AC="TG",TG="AC",
           CA="GT",GT="CA")


changeAlleles <- function(myrow) {
  if (!(myrow[5] %in% myrow[2:4])) {
    myrow <- c(myrow[1:4],sapply(myrow[5:length(myrow)], function(x) key[[x]]))
  }
  return(myrow)
} 

df2=as.data.frame(t(apply(df,1,changeAlleles)))

   SNP ref het hom  A  B  C  D  E  F  G  H
2  rs1  GG  AG  AA AG GG GG AA GG AG AG AG
3  rs2  AA  AG  GG AG AA GG AG AA AA AG AG
4  rs3  AA  AG  GG GG GG GG GG GG AG GG AG
5  rs4  GG  AG  AA AG AG GG AG AG GG AG GG
6  rs5  GG  AG  AA AA AA GG AA AG AG AG AG
7  rs6  GG  AG  AA AG GG AA AA AA AG AG AG
8  rs7  AA  AG  GG AA AA AA AA AA AA AA AA
9  rs8  CC  AC  AA CC AC AC CC AC CC CC AC
10 rs9  GG  AG  AA GG GG GG AG GG GG GG GG

答案 1 :(得分:1)

我们可以使用chartr

df1[5:12] <- lapply(df1[5:12], function(x) chartr('TC', 'AG', x))
df1
#  SNP ref het hom  A  B  C  D  E  F  G  H I
#1 rs1  GG  AG  AA AG GG GG AA GG AG AG AG …
#2 rs2  AA  AG  GG AG AA GG AG AA AA AG AG …
#3 rs3  AA  AG  GG GG GG GG GG GG AG GG AG …
#4 rs4  GG  AG  AA AG AG GG AG AG GG AG GG …
#5 rs5  GG  AG  AA AA AA GG AA AG AG AG AG …
#6 rs6  GG  AG  AA AG GG AA AA AA AG AG AG …
#7 rs7  AA  AG  GG AA AA AA AA AA AA AA AA …
#8 rs8  CC  AC  AA GG AG AG GG AG GG GG AG …
#9 rs9  GG  AG  AA GG GG GG AG GG GG GG GG …