我正在尝试计算给定DNA序列中的密码子频率。
例如:
sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']
对于密码子中的XX,:
frequency = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))
请注意,XX2和XX3并不总是在序列中。一些密码子可能有也可能没有多个密码子。
示例:赖氨酸具有2个密码子,AAA和AAG
所以频率
AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))
如何为列表中的每个密码子执行此操作?我如何解释多个密码子?
答案 0 :(得分:6)
使用defaultdict
from collections import defaultdict
mydict = defaultdict(int)
for aa in mysecuence:
mydict[aa] +=1
这适用于氨基酸(蛋白质) 对于密码子,您应该以3个位置步骤迭代序列以获取defaultdict的键。例如:
>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a
['GAU', 'CAC', 'TUG', 'CCA']
编辑:如果您想计算变性,您应该准备一个字典,将每个密码子(密钥)与其简并密码子(值,密码子列表)相关联。要计算经常性, 从defaultdict中,您可以获得每个密码子的计数,然后对于每个密码子,您计算从上面指出的密码子字典中读取的简并密码子的计数总和。然后你可以计算出频率。
编辑2:这里有一个真实的例子:
from collections import defaultdict
#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
"cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
"ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
"ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
"gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
"aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")
seq = rna.upper().replace('T', 'U')
#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
('UAA', 'UGA', 'UAG'))
#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
for codon in codons:
degen_dict[codon] = codons
#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]
#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
counts[codon] +=1
#actual calculation of frecuencies
data = {}
for codon in query_codons:
if codon in degen_dict:
totals = sum(counts[deg] for deg in degen_dict[codon])
frecuency = float(counts[codon]) / totals
else:
frecuency = 1.00
data[codon] = frecuency
#print results
for codon, frecuency in data.iteritems():
print "%s -> %.2f" %(codon, frecuency)
#produces:
GUC -> 0.57
AUA -> 1.00
ACG -> 0.50
AAC -> 1.00
CCU -> 0.25
UAU -> 1.00
..........
GCU -> 0.19
GAU -> 1.00
UAG -> 0.33
CUC -> 0.38
UUA -> 0.13
UGA -> 0.33
答案 1 :(得分:2)
如果您的序列在正确的阅读框中:
>>> import collections
>>>
>>> code = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
... 'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
... 'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
... 'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
... 'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
... 'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
... 'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
... 'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
... 'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
... 'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
... 'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
... 'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
... 'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
... 'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
... 'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
... 'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
... }
>>> def count_codons(cds):
... counts = collections.defaultdict(int)
... for i in range(0,len(cds),3):
... codon = cds[i:i+3]
... counts[codon] += 1
... return counts
...
>>> def translate(cds, code):
... return "".join((code[cds[i:i+3]] for i in range(0, len(cds), 3)))
...
>>> seq = 'ATGAAGAAA'
>>>
>>> codon_counts = count_codons(seq.lower())
>>> trans_seq = translate(seq.lower(), code)
>>>
>>> [(codon, code[codon], float(codon_counts[codon])/trans_seq.count(code[codon])) for codon in codon_counts.keys()]
[('atg', 'M', 1.0), ('aag', 'K', 0.5), ('aaa', 'K', 0.5)]
>>>
其他信息:
我认为你要求找到一种叫做密码子使用的东西。
有在线工具可以让您找到密码子使用情况。这个也允许离线使用。
http://www.bioinformatics.org/sms2/codon_usage.html
和结果(在'分数'中是你要求的):
Results for 9 residue sequence "sample sequence one" starting "ATGAAGAAA"
AmAcid Codon Number /1000 Fraction ..
Ala GCG 0.00 0.00 0.00
Ala GCA 0.00 0.00 0.00
Ala GCT 0.00 0.00 0.00
Ala GCC 0.00 0.00 0.00
Cys TGT 0.00 0.00 0.00
Cys TGC 0.00 0.00 0.00
Asp GAT 0.00 0.00 0.00
Asp GAC 0.00 0.00 0.00
Glu GAG 0.00 0.00 0.00
Glu GAA 0.00 0.00 0.00
Phe TTT 0.00 0.00 0.00
Phe TTC 0.00 0.00 0.00
Gly GGG 0.00 0.00 0.00
Gly GGA 0.00 0.00 0.00
Gly GGT 0.00 0.00 0.00
Gly GGC 0.00 0.00 0.00
His CAT 0.00 0.00 0.00
His CAC 0.00 0.00 0.00
Ile ATA 0.00 0.00 0.00
Ile ATT 0.00 0.00 0.00
Ile ATC 0.00 0.00 0.00
Lys AAG 1.00 333.33 0.50
Lys AAA 1.00 333.33 0.50
Leu TTG 0.00 0.00 0.00
Leu TTA 0.00 0.00 0.00
Leu CTG 0.00 0.00 0.00
Leu CTA 0.00 0.00 0.00
Leu CTT 0.00 0.00 0.00
Leu CTC 0.00 0.00 0.00
Met ATG 1.00 333.33 1.00
Asn AAT 0.00 0.00 0.00
Asn AAC 0.00 0.00 0.00
Pro CCG 0.00 0.00 0.00
Pro CCA 0.00 0.00 0.00
Pro CCT 0.00 0.00 0.00
Pro CCC 0.00 0.00 0.00
Gln CAG 0.00 0.00 0.00
Gln CAA 0.00 0.00 0.00
Arg AGG 0.00 0.00 0.00
Arg AGA 0.00 0.00 0.00
Arg CGG 0.00 0.00 0.00
Arg CGA 0.00 0.00 0.00
Arg CGT 0.00 0.00 0.00
Arg CGC 0.00 0.00 0.00
Ser AGT 0.00 0.00 0.00
Ser AGC 0.00 0.00 0.00
Ser TCG 0.00 0.00 0.00
Ser TCA 0.00 0.00 0.00
Ser TCT 0.00 0.00 0.00
Ser TCC 0.00 0.00 0.00
Thr ACG 0.00 0.00 0.00
Thr ACA 0.00 0.00 0.00
Thr ACT 0.00 0.00 0.00
Thr ACC 0.00 0.00 0.00
Val GTG 0.00 0.00 0.00
Val GTA 0.00 0.00 0.00
Val GTT 0.00 0.00 0.00
Val GTC 0.00 0.00 0.00
Trp TGG 0.00 0.00 0.00
Tyr TAT 0.00 0.00 0.00
Tyr TAC 0.00 0.00 0.00
End TGA 0.00 0.00 0.00
End TAG 0.00 0.00 0.00
End TAA 0.00 0.00 0.00
cusp是EMBOSS的密码子使用工具,也值得一看。
您可能想要查看BioPython以处理生物序列。我相信他们有一个密码子使用模块。
答案 2 :(得分:1)
PLY是一个解析器模块,具有一些很好的调试功能;它非常擅长这样的任务......
from ply import lex
tokens = (
'CODON',
)
t_CODON = (
r"ATG|"
r"AAG|"
r"AAF|"
r"AAC|"
r"BOB|"
r"FOO|"
r"BAR|"
r"AAA"
)
def t_error(t):
raise TypeError("Unknown codon '%s'" % (t.value,))
lex.lex()
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"
ccount = dict()
total = 0.0
lex.input(sequence)
for tok in iter(lex.token, None):
if ccount.get(tok.value, False):
ccount[tok.value] += 1
else:
ccount[tok.value] = 1
total += 1.0
for codon,count in ccount.items():
print "Frequency of %s is %f" % (codon, count/total)
运行该代码会产生......
[mpenning@Bucksnort ~]$ python codon.py
Frequency of BAR is 0.111111
Frequency of BOB is 0.111111
Frequency of FOO is 0.111111
Frequency of AAA is 0.444444
Frequency of AAC is 0.222222
当你开始介绍化学术语时,我有点迷失,但你可以从这里接管......
答案 3 :(得分:1)
包含所有64个密码子的密码子表,甚至是非去变性密码子(它们构成一个元素组)
在迭代过程中计算密码子出现的同时计算每个密码子组的出现次数
包含编码氨基酸名称的密码子表 - >一个很好的展示
代码:
from collections import defaultdict
# the first 600 nucleotides from GenBank: AAHX01097212.1
adn = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
"cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
"ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
"ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
"gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
"aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")
arn = adn.upper().replace('T','U')
#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
codon_table = ((('GCU', 'GCC', 'GCA', 'GCG'), 'Alanine'),
(('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'), 'Leucine'),
(('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'), 'Arginine'),
(('AAA', 'AAG'), 'Lysine'),
(('AAU', 'AAC'), 'Asparagine'),
(('AUG',), 'Methionine'),
(('GAU', 'GAC'), 'Aspartic acid' ),
(('UUU', 'UUC'), 'Phenylalanine'),
(('UGU', 'UGC'), 'Cysteine'),
(('CCU', 'CCC', 'CCA', 'CCG'), 'Proline') ,
(('CAA', 'CAG'), 'Glutamine'),
(('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'), 'Serine'),
(('GAA', 'GAG'), 'Glutamic acid'),
(('ACU', 'ACC', 'ACA', 'ACG'), 'Threonine'),
(('GGU', 'GGC', 'GGA', 'GGG'), 'Glycine'),
(('UGG',), 'Tryptophane'),
(('CAU', 'CAC'), 'Histidine'),
(('UAU', 'UAC'), 'Tyrosine'),
(('AUU', 'AUC', 'AUA'), 'Isoleucine'),
(('GUU', 'GUC', 'GUA', 'GUG'), 'Valine'),
(('UAA', 'UGA', 'UAG'), 'STOP') )
siblings = dict( (cod, codgroup) for codgroup,aa in codon_table for cod in codgroup )
cod_count, grp_count, freq = defaultdict(int), defaultdict(int), {}
for cod in (arn[i:i+3] for i in xrange(0,len(arn),3)):
cod_count[cod] += 1
grp_count[siblings[cod]] += 1
for cod in siblings.iterkeys(): # the keys of siblings are the 64 codons
if siblings[cod] in grp_count:
freq[cod] = float(cod_count[cod])/grp_count[siblings[cod]]
else:
freq[cod] = '-* Missing *-'
display = '\n'.join(aa.rjust(13)+\
'\n'.join('%s %-16s' % (cod.rjust(18 if i else 5),freq[cod])
for i,cod in enumerate(codgrp))
for codgrp,aa in codon_table)
# editing addition:
def outputResults(filename,arn,codon_table,displ):
li = ['This file is named %s' % filename]
li.append('The sequence of ARN:\n%s' %\
'\n'.join(arn[i:i+42] for i in xrange(0,len(arn),42)))
li.append('Size of the sequence : '+str(len(arn)))
li.append('Codon_table:\n'+\
'\n'.join('%s : %s' % (u,v) for u,v in codon_table))
li.append('Frequency results :\n'+displ)
with open(filename,'w') as f:
f.writelines('\n\n'.join(li))
outputResults('ARN_mem.txt',arn,codon_table,display)
print display
我添加了一个函数outputResults()来显示记录数据和结果的方式
生成的文件内容为:
This file is named ARN_mem.txt
The sequence of ARN:
UCCCCCGCAGCUUCGGGAACGUGCGGGCUCGGGAGGGAGGGG
CCUGGCGCCGGGCGCGCGCCUGCGCCCCACCCCGCCCCACCC
UGGCGGGUCUCGCGCGCCCGGCCCGCCUCCUGUCAACCCCAG
CGCGGCGGUCAGGUGGUCCCCAGCCCUUGGCCCCAGCCUCCA
GCUUCCUGGUCCCUCGGGCUCUGAGUCCUGUCUCCGGCAGAU
CGCCUUUCUGAUUGUUCUCCUGCGCAGCUGGAGGUGUAUAGC
CCCUAGCCGAGCUAUGGUGCCUCAGCAGAUGUGAGGAGGUAG
UGGGUCAGGAUAAACCCGCGCACUCCAUAAUAACGUGCCAGG
GCUCAGUGACUUGGGUCUGCAUUA
Size of the sequence : 360
Codon_table:
('GCU', 'GCC', 'GCA', 'GCG') : Alanine
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG') : Leucine
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG') : Arginine
('AAA', 'AAG') : Lysine
('AAU', 'AAC') : Asparagine
('AUG',) : Methionine
('GAU', 'GAC') : Aspartic acid
('UUU', 'UUC') : Phenylalanine
('UGU', 'UGC') : Cysteine
('CCU', 'CCC', 'CCA', 'CCG') : Proline
('CAA', 'CAG') : Glutamine
('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC') : Serine
('GAA', 'GAG') : Glutamic acid
('ACU', 'ACC', 'ACA', 'ACG') : Threonine
('GGU', 'GGC', 'GGA', 'GGG') : Glycine
('UGG',) : Tryptophane
('CAU', 'CAC') : Histidine
('UAU', 'UAC') : Tyrosine
('AUU', 'AUC', 'AUA') : Isoleucine
('GUU', 'GUC', 'GUA', 'GUG') : Valine
('UAA', 'UGA', 'UAG') : STOP
Frequency results :
Alanine GCU 0.1875
GCC 0.375
GCA 0.25
GCG 0.1875
Leucine UUA 0.125
UUG 0.0
CUU 0.25
CUC 0.375
etc.............
答案 4 :(得分:0)
我不确定我是否完全理解了这个问题,但我认为你需要将计算分为两个阶段:首先计算每个密码子出现的次数,然后计算出频率。我想出了以下代码:
from collections import defaultdict
# Initial sequence.
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"
# Which codons are grouped together.
groups = (
('AAA', 'AAC'),
('BOB',),
('FOO', 'BAR', 'BAA'),
)
# Separate into list of codons.
codonList = []
for codons in range(0, len(sequence), 3):
codonList.append(sequence[codons:codons+3])
# Count how many times each codon is used.
counts = defaultdict(int)
for codon in codonList:
counts[codon] += 1
# Go through and calculate frequencies of each codon.
freqs = {}
for group in groups:
total = float(sum(counts[codon] for codon in group))
for codon in group:
freqs[codon] = counts[codon] / total
# Done.
print freqs
注意在最后一个循环中将total
显式转换为浮点数。如果它保留为整数,则Python 2.x上的后续除法将为0或1,因此我们需要转换它以获得浮点输出。我得到的输出是:
blair@blair-eeepc:~$ python codons.py
{'BAR': 0.5, 'AAC': 0.33333333333333331, 'BAA': 0.0, 'AAA': 0.66666666666666663, 'BOB': 1.0, 'FOO': 0.5}
这是你想要的那种输出吗?