我正在尝试将一个字母的AA变体替换成3个字母的代码(以便更容易阅读)。一切都很完美,但很少有错误。以下是我的评论代码。谢谢
x <- c("p.G12C","p.F121S","p.P124S","p.P124L","p.E13D",
"p.E203K","p.Q209P","p.Q209P","p.Q209L")
aa3 <- c("Ala", "Arg", "Asn", "Asp", "Cys", "Glu", "Gln", "Gly", "His",
"Ile", "Leu", "Lys", "Met", "Phe", "Pro", "Ser", "Thr", "Trp",
"Tyr", "Val")
aa1 <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V")
for (i in 1:length(aa1))
{
xy <- gsub(aa1[i],aa3[i],x,ignore.case = F)
}
输出
# Note that E, F and Q have unusual 3 letter replacement.
I could not figure out what is causing this.
xy
[1] "p.Gly12Cys" "p.Prohe121Ser" "p.Pro124Ser" "p.Pro124Leu"
"p.Glylu13Asp" "p.Glylu203Lys" "p.Glyln209Pro" "p.Glyln209Pro" "p.Glyln209Leu"
预期输出
"p.Gly12Cys" "p.Phe121Ser" "p.Pro124Ser" "p.Pro124Leu" "p.Glu13Asp"
"p.Glu203Lys" "p.Gln209Pro" "p.Gln209Pro" "p.Gln209Leu"
错误
outputs "p.Prohe121Ser"instead of "p.Phe121Ser"
"p.Glylu13Asp" instead of "p.Glu13Asp"
答案 0 :(得分:4)
我们可以使用mgsub
library(qdap)
mgsub(aa1, aa3, x)
#[1] "p.Gly12Cys" "p.Phe121Ser" "p.Pro124Ser" "p.Pro124Leu"
#[5] "p.Glu13Alasp" "p.Glu203Leuys" "p.Gln209Pro" "p.Gln209Pro"
#[9] "p.Gln209Leu"
d1 <- read.csv(text=sub('(..)(.)(\\d+)(.)', '\\1,\\2,\\3,\\4', x),
header=FALSE, stringsAsFactors=FALSE)
d1[c(2,4)] <- lapply(d1[,c(2,4)], function(x) aa3[match(x, aa1)])
do.call(paste0, d1)
#[1] "p.Gly12Cys" "p.Phe121Ser" "p.Pro124Ser" "p.Pro124Leu" "p.Glu13Asp"
#[6] "p.Glu203Lys" "p.Gln209Pro" "p.Gln209Pro" "p.Gln209Leu"
或使用gsubfn
library(gsubfn)
gsubfn('[A-Z]', setNames(as.list(aa3), aa1), x)
#[1] "p.Gly12Cys" "p.Phe121Ser" "p.Pro124Ser" "p.Pro124Leu" "p.Glu13Asp"
#[6] "p.Glu203Lys" "p.Gln209Pro" "p.Gln209Pro" "p.Gln209Leu"
答案 1 :(得分:4)
library(stringr)
str_replace_all(x,
c(
"A"="Ala", "R"="Arg", "N"="Asn", "D"="Asp",
"C"="Cys", "E"="Glu", "Q"="Gln", "G"="Gly",
"H"="His", "I"="Ile", "L"="Leu", "K"="Lys",
"M"="Met", "F"="Phe", "P"="Pro", "S"="Ser",
"T"="Thr", "W"="Trp", "Y"="Tyr", "V"="Val"
)
)
答案 2 :(得分:4)
这是基础R解决方案:
ref <- aa3
names(ref) <- aa1
tmp <- do.call(rbind, regmatches(x, regexec("p\\.([A-Z])([0-9]+)([A-Z])", x)))
tmp2 <- apply(tmp[, c(2, 4)], 2, FUN = function(x) ref[x])
paste0("p.", tmp2[, 1], tmp[, 3], tmp2[, 2])
#[1] "p.Gly12Cys" "p.Phe121Ser" "p.Pro124Ser" "p.Pro124Leu" "p.Glu13Asp" "p.Glu203Lys" "p.Gln209Pro" "p.Gln209Pro" "p.Gln209Leu"
你基本上将你的字符串分成了组成部分,例如"p.Q209L"
分为p.
,Q
,209
和L
。然后使用参考向量将氨基酸单字母表示与其3字母版本交换,或者使用akrun的方法可以取消ref[x]
(以及另外两行!)并使用aa3[match(x, aa1)]
代替。然后把东西粘在一起。