反向互补基地

时间:2015-03-09 01:58:33

标签: r loops genetics

我是R编程的新手,我正在尝试为Reverse和Complementary Base编写一个程序。目的是设计DNA引物。所以我有一个基因A T C G和A补体T的DNA序列; T = A; C = G; G =℃。  我已经想出如何反转它,但是对于补语,我只能让它只回答1个碱基,但不能完全是序列,我不知道如何组合反向和补充函数。这是我的代码,我完全对它感到困惑。有人可以帮我解决这个问题吗?你将成为我生命的救星!

strReverse <- function(x) 

sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
strReverse(c("ATCGGTCAATCGA"))

complement.base = function(base){ 
  if(base == 'A' | base ==  'a')   print("T") 
  if(base == 'T' | base == 't')         print("A")
  if(base == 'G' | base == 'g')     print("C")
  if(base == 'C' | base == 'c')     print("G")}
complement.base(base="A")

2 个答案:

答案 0 :(得分:2)

您可以使用Rcpp有效地执行操作:

library(Rcpp)
revComp.rcpp <- cppFunction(
"std::string comp(std::string x) {
  const int n = x.length();
  for (int i=0; i < n; ++i) {
    if (x[i] == 'A' || x[i] == 'a')  x[i] = 'T';
    else if (x[i] == 'T' || x[i] == 't')  x[i] = 'A';
    else if (x[i] == 'G' || x[i] == 'g')  x[i] = 'C';
    else  x[i] = 'G';
  }
  std::reverse(x.begin(), x.end());
  return x;
}")
revComp.rcpp("ATCGGTCAATCGA")
# [1] "TCGATTGACCGAT"

这似乎比Biostrings包中的相关代码(在具有1300万个碱基的字符串上测试)要快一些:

library(Biostrings)
x <- "ATCGGTCAATCGA"
big.x <- paste(rep(x, 1000000), collapse="")
big.x2 <- DNAString(big.x)
rev.biostr <- function(x) as.character(reverseComplement(x))
all.equal(revComp.rcpp(big.x), as.character(reverseComplement(big.x2)))
# [1] TRUE

library(microbenchmark)
microbenchmark(revComp.rcpp(big.x), as.character(reverseComplement(big.x2)))
# Unit: milliseconds
#                                     expr       min        lq      mean    median        uq      max neval
#                      revComp.rcpp(big.x)  77.21618  78.44534  84.54397  82.21002  87.49367 123.8166   100
#  as.character(reverseComplement(big.x2)) 144.13900 151.12869 170.73765 156.44300 164.41374 399.2948   100

答案 1 :(得分:1)

我实际上会考虑使用基数R中的chartr,并在stringi的帮助下反转结果(或输入)。

myFun <- function(invec) {
  require(stringi)
  invec <- stri_reverse(invec)
  chartr(old = "AaTtGgCc", new = "TTAACCGG", invec)
}

x <- "ATCGGTCAATCGA"
myFun(x)
# [1] "TCGATTGACCGAT"

使用@ josilber的样本数据,它与他的Rcpp方法非常相似:

all.equal(myFun(big.x), revComp.rcpp(big.x))
# [1] TRUE

library(microbenchmark)
microbenchmark(myFun(big.x), revComp.rcpp(big.x))
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval
#         myFun(big.x) 349.5797 352.8197 362.3009 356.4484 362.7197 437.9556   100
#  revComp.rcpp(big.x) 359.5485 363.8615 378.3465 368.3360 386.3734 444.2901   100