定义计算氨基酸相对频率的函数

时间:2011-04-25 00:13:50

标签: python bioinformatics

我正在尝试计算给定DNA序列中的密码子频率。

例如:

sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']
对于密码子中的XX,

frequency  = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))

请注意,XX2和XX3并不总是在序列中。一些密码子可能有也可能没有多个密码子。

示例:赖氨酸具有2个密码子,AAA和AAG

所以频率

AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))

如何为列表中的每个密码子执行此操作?我如何解释多个密码子?

5 个答案:

答案 0 :(得分:6)

使用defaultdict

from collections import defaultdict

mydict = defaultdict(int)

for aa in mysecuence:
    mydict[aa] +=1

这适用于氨基酸(蛋白质) 对于密码子,您应该以3个位置步骤迭代序列以获取defaultdict的键。例如:

>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a


['GAU', 'CAC', 'TUG', 'CCA']

编辑:如果您想计算变性,您应该准备一个字典,将每个密码子(密钥)与其简并密码子(值,密码子列表)相关联。要计算经常性, 从defaultdict中,您可以获得每个密码子的计数,然后对于每个密码子,您计算从上面指出的密码子字典中读取的简并密码子的计数总和。然后你可以计算出频率。

编辑2:这里有一个真实的例子:

from collections import defaultdict

#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
       "cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
       "ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
       "ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
       "gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
       "aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")

seq = rna.upper().replace('T', 'U')

#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
               ('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
               ('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
               ('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
               ('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
               ('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
               ('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
               ('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
               ('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
               ('UAA', 'UGA', 'UAG'))

#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
    for codon in codons:
        degen_dict[codon] = codons

#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]

#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
    counts[codon] +=1

#actual calculation of frecuencies
data = {}
for codon in query_codons:
    if codon in  degen_dict:
        totals = sum(counts[deg] for deg in degen_dict[codon])
        frecuency = float(counts[codon]) / totals
    else:
        frecuency = 1.00

    data[codon] = frecuency

#print results
for codon, frecuency in data.iteritems():
    print "%s  -> %.2f" %(codon, frecuency)


#produces:
GUC  -> 0.57
AUA  -> 1.00
ACG  -> 0.50
AAC  -> 1.00
CCU  -> 0.25
UAU  -> 1.00
..........
GCU  -> 0.19
GAU  -> 1.00
UAG  -> 0.33
CUC  -> 0.38
UUA  -> 0.13
UGA  -> 0.33

答案 1 :(得分:2)

如果您的序列在正确的阅读框中:

>>> import collections
>>> 
>>> code = {     'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
...              'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
...              'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
...              'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
...              'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
...              'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
...              'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
...              'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
...              'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
...              'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
...              'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
...              'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
...              'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
...              'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
...              'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
...              'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
...         }
>>> def count_codons(cds):
...     counts = collections.defaultdict(int)
...     for i in range(0,len(cds),3):
...        codon = cds[i:i+3]
...        counts[codon] += 1
...     return counts
... 
>>> def translate(cds, code):
...     return "".join((code[cds[i:i+3]] for i in range(0, len(cds), 3)))
... 
>>> seq = 'ATGAAGAAA'
>>> 
>>> codon_counts = count_codons(seq.lower())
>>> trans_seq = translate(seq.lower(), code)
>>> 
>>> [(codon, code[codon], float(codon_counts[codon])/trans_seq.count(code[codon])) for codon in codon_counts.keys()]
[('atg', 'M', 1.0), ('aag', 'K', 0.5), ('aaa', 'K', 0.5)]
>>> 

其他信息:

我认为你要求找到一种叫做密码子使用的东西。

有在线工具可以让您找到密码子使用情况。这个也允许离线使用。

http://www.bioinformatics.org/sms2/codon_usage.html

和结果(在'分数'中是你要求的):

Results for 9 residue sequence "sample sequence one" starting "ATGAAGAAA"
AmAcid   Codon     Number        /1000     Fraction   .. 

Ala      GCG         0.00         0.00         0.00 
Ala      GCA         0.00         0.00         0.00 
Ala      GCT         0.00         0.00         0.00 
Ala      GCC         0.00         0.00         0.00 

Cys      TGT         0.00         0.00         0.00 
Cys      TGC         0.00         0.00         0.00 

Asp      GAT         0.00         0.00         0.00 
Asp      GAC         0.00         0.00         0.00 

Glu      GAG         0.00         0.00         0.00 
Glu      GAA         0.00         0.00         0.00 

Phe      TTT         0.00         0.00         0.00 
Phe      TTC         0.00         0.00         0.00 

Gly      GGG         0.00         0.00         0.00 
Gly      GGA         0.00         0.00         0.00 
Gly      GGT         0.00         0.00         0.00 
Gly      GGC         0.00         0.00         0.00 

His      CAT         0.00         0.00         0.00 
His      CAC         0.00         0.00         0.00 

Ile      ATA         0.00         0.00         0.00 
Ile      ATT         0.00         0.00         0.00 
Ile      ATC         0.00         0.00         0.00 

Lys      AAG         1.00       333.33         0.50 
Lys      AAA         1.00       333.33         0.50 

Leu      TTG         0.00         0.00         0.00 
Leu      TTA         0.00         0.00         0.00 
Leu      CTG         0.00         0.00         0.00 
Leu      CTA         0.00         0.00         0.00 
Leu      CTT         0.00         0.00         0.00 
Leu      CTC         0.00         0.00         0.00 

Met      ATG         1.00       333.33         1.00 

Asn      AAT         0.00         0.00         0.00 
Asn      AAC         0.00         0.00         0.00 

Pro      CCG         0.00         0.00         0.00 
Pro      CCA         0.00         0.00         0.00 
Pro      CCT         0.00         0.00         0.00 
Pro      CCC         0.00         0.00         0.00 

Gln      CAG         0.00         0.00         0.00 
Gln      CAA         0.00         0.00         0.00 

Arg      AGG         0.00         0.00         0.00 
Arg      AGA         0.00         0.00         0.00 
Arg      CGG         0.00         0.00         0.00 
Arg      CGA         0.00         0.00         0.00 
Arg      CGT         0.00         0.00         0.00 
Arg      CGC         0.00         0.00         0.00 

Ser      AGT         0.00         0.00         0.00 
Ser      AGC         0.00         0.00         0.00 
Ser      TCG         0.00         0.00         0.00 
Ser      TCA         0.00         0.00         0.00 
Ser      TCT         0.00         0.00         0.00 
Ser      TCC         0.00         0.00         0.00 

Thr      ACG         0.00         0.00         0.00 
Thr      ACA         0.00         0.00         0.00 
Thr      ACT         0.00         0.00         0.00 
Thr      ACC         0.00         0.00         0.00 

Val      GTG         0.00         0.00         0.00 
Val      GTA         0.00         0.00         0.00 
Val      GTT         0.00         0.00         0.00 
Val      GTC         0.00         0.00         0.00 

Trp      TGG         0.00         0.00         0.00 

Tyr      TAT         0.00         0.00         0.00 
Tyr      TAC         0.00         0.00         0.00 

End      TGA         0.00         0.00         0.00 
End      TAG         0.00         0.00         0.00 
End      TAA         0.00         0.00         0.00 

cusp是EMBOSS的密码子使用工具,也值得一看。

您可能想要查看BioPython以处理生物序列。我相信他们有一个密码子使用模块。

答案 2 :(得分:1)

PLY是一个解析器模块,具有一些很好的调试功能;它非常擅长这样的任务......

from ply import lex

tokens = (
    'CODON',
)
t_CODON = (
    r"ATG|"
    r"AAG|"
    r"AAF|"
    r"AAC|"
    r"BOB|"
    r"FOO|"
    r"BAR|"
    r"AAA"
)
def t_error(t):
    raise TypeError("Unknown codon '%s'" % (t.value,))
lex.lex()
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"
ccount = dict()
total = 0.0
lex.input(sequence)
for tok in iter(lex.token, None):
    if ccount.get(tok.value, False):
        ccount[tok.value] += 1
    else:
        ccount[tok.value] = 1
    total += 1.0

for codon,count in ccount.items():
    print "Frequency of %s is %f" % (codon, count/total)

运行该代码会产生......

[mpenning@Bucksnort ~]$ python codon.py
Frequency of BAR is 0.111111
Frequency of BOB is 0.111111
Frequency of FOO is 0.111111
Frequency of AAA is 0.444444
Frequency of AAC is 0.222222

当你开始介绍化学术语时,我有点迷失,但你可以从这里接管......

答案 3 :(得分:1)

  • 包含所有64个密码子的密码子表,甚至是非去变性密码子(它们构成一个元素组)

  • 在迭代过程中计算密码子出现的同时计算每个密码子组的出现次数

  • 包含编码氨基酸名称的密码子表 - >一个很好的展示

代码:

from collections import defaultdict

# the first 600 nucleotides from GenBank: AAHX01097212.1
adn = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
       "cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
       "ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
       "ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
       "gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
       "aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")

arn = adn.upper().replace('T','U')

#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
codon_table = ((('GCU', 'GCC', 'GCA', 'GCG'),  'Alanine'),
               (('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),  'Leucine'),
               (('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),  'Arginine'),
               (('AAA', 'AAG'),  'Lysine'),
               (('AAU', 'AAC'),  'Asparagine'),
               (('AUG',),  'Methionine'),
               (('GAU', 'GAC'),  'Aspartic acid' ),              
               (('UUU', 'UUC'),  'Phenylalanine'),
               (('UGU', 'UGC'),  'Cysteine'),
               (('CCU', 'CCC', 'CCA', 'CCG'),  'Proline') ,
               (('CAA', 'CAG'),  'Glutamine'),
               (('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),  'Serine'),
               (('GAA', 'GAG'),  'Glutamic acid'),
               (('ACU', 'ACC', 'ACA', 'ACG'),  'Threonine'),
               (('GGU', 'GGC', 'GGA', 'GGG'),  'Glycine'),
               (('UGG',),  'Tryptophane'),
               (('CAU', 'CAC'),  'Histidine'),
               (('UAU', 'UAC'),  'Tyrosine'),
               (('AUU', 'AUC', 'AUA'),  'Isoleucine'),
               (('GUU', 'GUC', 'GUA', 'GUG'),  'Valine'),
               (('UAA', 'UGA', 'UAG'),  'STOP')            )

siblings = dict( (cod, codgroup) for codgroup,aa in codon_table for cod in codgroup )

cod_count, grp_count, freq = defaultdict(int), defaultdict(int), {}

for cod in (arn[i:i+3] for i in xrange(0,len(arn),3)):
    cod_count[cod] += 1
    grp_count[siblings[cod]] += 1

for cod in siblings.iterkeys(): # the keys of siblings are the 64 codons
    if siblings[cod] in grp_count:
        freq[cod] = float(cod_count[cod])/grp_count[siblings[cod]]
    else:
        freq[cod] = '-* Missing *-'


display = '\n'.join(aa.rjust(13)+\
                '\n'.join('%s  %-16s' % (cod.rjust(18 if i else 5),freq[cod])
                          for i,cod in enumerate(codgrp))
                for codgrp,aa in codon_table)


# editing addition:

def outputResults(filename,arn,codon_table,displ):

    li = ['This file is named %s' % filename]

    li.append('The sequence of ARN:\n%s' %\
              '\n'.join(arn[i:i+42] for i in xrange(0,len(arn),42)))
    li.append('Size of the sequence : '+str(len(arn)))

    li.append('Codon_table:\n'+\
              '\n'.join('%s : %s' % (u,v) for u,v in codon_table))

    li.append('Frequency results :\n'+displ)

    with open(filename,'w') as f:
        f.writelines('\n\n'.join(li))


outputResults('ARN_mem.txt',arn,codon_table,display)
print display 

修改

我添加了一个函数outputResults()来显示记录数据和结果的方式

生成的文件内容为:

This file is named ARN_mem.txt

The sequence of ARN:
UCCCCCGCAGCUUCGGGAACGUGCGGGCUCGGGAGGGAGGGG
CCUGGCGCCGGGCGCGCGCCUGCGCCCCACCCCGCCCCACCC
UGGCGGGUCUCGCGCGCCCGGCCCGCCUCCUGUCAACCCCAG
CGCGGCGGUCAGGUGGUCCCCAGCCCUUGGCCCCAGCCUCCA
GCUUCCUGGUCCCUCGGGCUCUGAGUCCUGUCUCCGGCAGAU
CGCCUUUCUGAUUGUUCUCCUGCGCAGCUGGAGGUGUAUAGC
CCCUAGCCGAGCUAUGGUGCCUCAGCAGAUGUGAGGAGGUAG
UGGGUCAGGAUAAACCCGCGCACUCCAUAAUAACGUGCCAGG
GCUCAGUGACUUGGGUCUGCAUUA

Size of the sequence : 360

Codon_table:
('GCU', 'GCC', 'GCA', 'GCG') : Alanine
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG') : Leucine
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG') : Arginine
('AAA', 'AAG') : Lysine
('AAU', 'AAC') : Asparagine
('AUG',) : Methionine
('GAU', 'GAC') : Aspartic acid
('UUU', 'UUC') : Phenylalanine
('UGU', 'UGC') : Cysteine
('CCU', 'CCC', 'CCA', 'CCG') : Proline
('CAA', 'CAG') : Glutamine
('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC') : Serine
('GAA', 'GAG') : Glutamic acid
('ACU', 'ACC', 'ACA', 'ACG') : Threonine
('GGU', 'GGC', 'GGA', 'GGG') : Glycine
('UGG',) : Tryptophane
('CAU', 'CAC') : Histidine
('UAU', 'UAC') : Tyrosine
('AUU', 'AUC', 'AUA') : Isoleucine
('GUU', 'GUC', 'GUA', 'GUG') : Valine
('UAA', 'UGA', 'UAG') : STOP

Frequency results :
      Alanine  GCU  0.1875          
               GCC  0.375           
               GCA  0.25            
               GCG  0.1875          
      Leucine  UUA  0.125           
               UUG  0.0             
               CUU  0.25            
               CUC  0.375   
etc.............

答案 4 :(得分:0)

我不确定我是否完全理解了这个问题,但我认为你需要将计算分为两个阶段:首先计算每个密码子出现的次数,然后计算出频率。我想出了以下代码:

from collections import defaultdict

# Initial sequence.
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"

# Which codons are grouped together.
groups = (
    ('AAA', 'AAC'),
    ('BOB',),
    ('FOO', 'BAR', 'BAA'),
)

# Separate into list of codons.
codonList = []
for codons in range(0, len(sequence), 3):
    codonList.append(sequence[codons:codons+3])

# Count how many times each codon is used.
counts = defaultdict(int)
for codon in codonList:
    counts[codon] += 1

# Go through and calculate frequencies of each codon.
freqs = {}
for group in groups:
    total = float(sum(counts[codon] for codon in group))
    for codon in group:
        freqs[codon] = counts[codon] / total

# Done.
print freqs

注意在最后一个循环中将total显式转换为浮点数。如果它保留为整数,则Python 2.x上的后续除法将为0或1,因此我们需要转换它以获得浮点输出。我得到的输出是:

blair@blair-eeepc:~$ python codons.py 
{'BAR': 0.5, 'AAC': 0.33333333333333331, 'BAA': 0.0, 'AAA': 0.66666666666666663, 'BOB': 1.0, 'FOO': 0.5}

这是你想要的那种输出吗?