我有一个数据框,每行有三个参考列ref,het和hom,我想要替换G = C,A = T,AG = TC的列中的字母/基因型,反之亦然参考栏目。
structure(list(SNP = c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6",
"rs7", "rs8", "rs9"), ref = c("GG", "AA", "AA", "GG", "GG", "GG",
"AA", "CC", "GG"), het = c("AG", "AG", "AG", "AG", "AG", "AG",
"AG", "AC", "AG"), hom = c("AA", "GG", "GG", "AA", "AA", "AA",
"GG", "AA", "AA"), A = c("TC", "TC", "CC", "AG", "TT", "TC",
"AA", "GG", "GG"), B = c("CC", "TT", "CC", "AG", "TT", "CC",
"AA", "TG", "GG"), C = c("CC", "CC", "CC", "GG", "CC", "TT",
"AA", "TG", "GG"), D = c("TT", "TC", "CC", "AG", "TT", "TT",
"AA", "GG", "AG"), E = c("CC", "TT", "CC", "AG", "TC", "TT",
"AA", "TG", "GG"), F = c("TC", "TT", "TC", "GG", "TC", "TC",
"AA", "GG", "GG"), G = c("TC", "TC", "CC", "AG", "TC", "TC",
"AA", "GG", "GG"), H = c("TC", "TC", "TC", "GG", "TC", "TC",
"AA", "TG", "GG")), .Names = c("SNP", "ref", "het", "hom", "A",
"B", "C", "D", "E", "F", "G", "H"), class = "data.frame", row.names =
c(NA,
-9L))
Input:
SNP ref het hom A B C D E F G H I
rs1 GG AG AA TC CC CC TT CC TC TC TC …
rs2 AA AG GG TC TT CC TC TT TT TC TC …
rs3 AA AG GG CC CC CC CC CC TC CC TC …
rs4 GG AG AA AG AG GG AG AG GG AG GG …
rs5 GG AG AA TT TT CC TT TC TC TC TC …
rs6 GG AG AA TC CC TT TT TT TC TC TC …
rs7 AA AG GG AA AA AA AA AA AA AA AA …
rs8 CC AC AA GG TG TG GG TG GG GG TG …
rs9 GG AG AA GG GG GG AG GG GG GG GG …
Desired Output:
SNP ref het hom A B C D E F G H I
rs1 GG AG AA AG GG GG AA GG AG AG AG …
rs2 AA AG GG AG AA GG AG AA AA AG AG …
rs3 AA AG GG GG GG GG GG GG AG GG AG …
rs4 GG AG AA AG AG GG AG AG GG AG GG …
rs5 GG AG AA AA AA GG AA AG AG AG AG …
rs6 GG AG AA AG GG AA AA AA AG AG AG …
rs7 AA AG GG AA AA AA AA AA AA AA AA …
rs8 CC AC AA AA AC AC CC AC CC CC AC …
rs9 GG AG AA GG GG GG AG GG GG GG GG …
如何编写一个函数来根据参考列替换这些字母?谢谢。
答案 0 :(得分:2)
我们可以创建一个包含所有可能基因型及其对应关系的“字典”,而不是通过SNP列表,检查第一个元素(A列)。如果它不在ref / het / hom中,那么我们假设需要更改该行中的元素,否则我们只是按原样返回该行。
key = list(AA="TT",TT="AA",
GG="CC",CC="GG",
AG="TC",TC="AG",
GA="CT",CT="GA",
AC="TG",TG="AC",
CA="GT",GT="CA")
changeAlleles <- function(myrow) {
if (!(myrow[5] %in% myrow[2:4])) {
myrow <- c(myrow[1:4],sapply(myrow[5:length(myrow)], function(x) key[[x]]))
}
return(myrow)
}
df2=as.data.frame(t(apply(df,1,changeAlleles)))
SNP ref het hom A B C D E F G H
2 rs1 GG AG AA AG GG GG AA GG AG AG AG
3 rs2 AA AG GG AG AA GG AG AA AA AG AG
4 rs3 AA AG GG GG GG GG GG GG AG GG AG
5 rs4 GG AG AA AG AG GG AG AG GG AG GG
6 rs5 GG AG AA AA AA GG AA AG AG AG AG
7 rs6 GG AG AA AG GG AA AA AA AG AG AG
8 rs7 AA AG GG AA AA AA AA AA AA AA AA
9 rs8 CC AC AA CC AC AC CC AC CC CC AC
10 rs9 GG AG AA GG GG GG AG GG GG GG GG
答案 1 :(得分:1)
我们可以使用chartr
df1[5:12] <- lapply(df1[5:12], function(x) chartr('TC', 'AG', x))
df1
# SNP ref het hom A B C D E F G H I
#1 rs1 GG AG AA AG GG GG AA GG AG AG AG …
#2 rs2 AA AG GG AG AA GG AG AA AA AG AG …
#3 rs3 AA AG GG GG GG GG GG GG AG GG AG …
#4 rs4 GG AG AA AG AG GG AG AG GG AG GG …
#5 rs5 GG AG AA AA AA GG AA AG AG AG AG …
#6 rs6 GG AG AA AG GG AA AA AA AG AG AG …
#7 rs7 AA AG GG AA AA AA AA AA AA AA AA …
#8 rs8 CC AC AA GG AG AG GG AG GG GG AG …
#9 rs9 GG AG AA GG GG GG AG GG GG GG GG …