假设我有一个DNA序列。我想得到它的补充。我使用了以下代码,但我没有得到它。我做错了什么?
s=readline()
ATCTCGGCGCGCATCGCGTACGCTACTAGC
p=unlist(strsplit(s,""))
h=rep("N",nchar(s))
unlist(lapply(p,function(d){
for b in (1:nchar(s)) {
if (p[b]=="A") h[b]="T"
if (p[b]=="T") h[b]="A"
if (p[b]=="G") h[b]="C"
if (p[b]=="C") h[b]="G"
}
答案 0 :(得分:13)
使用为此目的而构建的chartr
:
> s
[1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> chartr("ATGC","TACG",s)
[1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG"
给它两个等长的字符串和你的字符串。同时对翻译论证进行了矢量化:
> chartr("ATGC","TACG",c("AAAACG","TTTTT"))
[1] "TTTTGC" "AAAAA"
注意我正在替换DNA的字符串表示而不是矢量。为了转换向量,我创建了一个查找映射作为命名向量和索引:
> p
[1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T"
[20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C"
> map=c("A"="T", "T"="A","G"="C","C"="G")
> unname(map[p])
[1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
[20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
答案 1 :(得分:11)
Bioconductor包Biostrings为此类操作提供了许多有用的功能。安装一次:
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
然后使用
library(Biostrings)
dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA"))
complement(dna)
答案 2 :(得分:5)
sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G")
A T C T C G G C G C G C A T C G C G T
"T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
A C G C T A C T A G C
"T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
如果您不想要补充名称,可以随时使用unname
删除它们。
unname(sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G") )
[1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C"
[19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
>
答案 3 :(得分:5)
还有一个包seqinr
library(seqinr)
comp(seq) # gives complement
rev(comp(seq)) # gives the reverse complement
Biostrings具有更小的内存配置文件,但seqinr也很好,因为您可以选择基础的情况(包括混合)并将它们更改为您想要的任何内容,例如,如果您想要混合使用T和U相同的序列。 Biostrings强迫你有T或U.
答案 4 :(得分:5)
为了补充,无论是大写还是小写,您都可以使用import json
# Create some random data structure
animals = zip(['dogs', 'cats', 'mice'], [124156, 858532, 812885])
data = {k:{v: {k: [v, v, v, k, v, {k: [k, k, k, k]}]}} for k, v in animals}
# Create a new file, dump the data to it
with open('filename.ext', 'w+') as file:
json.dump(data, file, indent=4, sort_keys=False)
# Open the same file, load it back as a new variable
with open('filename.ext') as file:
new_dictionary = json.load(file)
# Make some changes to the dict
new_dictionary['new_key'] = 'hello python'
# Open the file back up again and rewrite the new data
with open('filename.ext', 'w+') as file:
json.dump(new_dictionary, file, indent=4)
:
chartr()
为了更进一步并反向补充核苷酸序列,您可以使用以下功能:
n <- "ACCTGccatGCATC"
chartr("acgtACGT", "tgcaTGCA", n)
# [1] "TGGACggtaCGTAG"
答案 5 :(得分:0)
使用r为底的答案。用一种可怕的格式写的东西,使事情变得清晰并保持一线不变。它支持大写和小写。
revc = function(s){
paste0(
rev(
unlist(
strsplit(
chartr("ATGCatgc","TACGtacg",s)
, "") # from strsplit
) # from unlist
) # from rev
, collapse='') # from paste0
}
答案 6 :(得分:0)
我已经使用rev(comp(seq))
软件包概括了解决方案seqinr
:
install.packages("devtools")
devtools::install_github("TomKellyGenetics/tktools")
tktools::revcomp(seq)
此版本与字符串输入兼容,并且已向量化以处理多个字符串的列表或向量输入。输出类应与输入匹配,包括大小写和类型。这也支持RNA和RNA输出序列中包含“ U”的输入。
> seq <- "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> revcomp(seq)
[1] "GCTAGTAGCGTACGCGATGCGCGCCGAGAT"
> seq <- c("TATAAT", "TTTCGC", "atgcat")
> revcomp(seq)
TATAAT TTTCGC atgcat
"ATTATA" "GCGAAA" "atgcat"
请参见manual或TomKellyGenetics/tktools github软件包存储库。