罗莎琳德问题的共识和概况

时间:2015-04-08 05:11:23

标签: python bioinformatics biopython

我正在研究Rosalind问题,特别是题为“共识与概况”的问题

数据输入如下:

 >Rosalind_1
 ATCCAGCT
 >Rosalind_2
 GGGCAACT
 >Rosalind_3
 ATGGATCT
 >Rosalind_4
 AAGCAACC
 >Rosalind_5
 TTGGAACT
 >Rosalind_6
 ATGCCATT
 >Rosalind_7
 ATGGCACT

以上是七个带有ID或标题的DNA序列,输出应该是这样的:

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6 

现在,到目前为止,这是我的代码,我想生成上面的矩阵,按列计算所有A的C,G和Ts:

import sys
import Bio.SeqIO

count = {}
count=OrderedDict()
list_seq = [] 
for seq in Bio.SeqIO.parse(sys.stdin, 'fasta'):
    sequn = str(seq.seq)
    print "sequn",sequn
    for i,nuc in enumerate(sequn):
            print "nuc", nuc 
            key = (nuc,i)
            try:
                    count[key] = count[key]+1
            except KeyError:
                    count[key] = 1

字典计数的输出如下所示:

([(('A', 0), 5), (('T', 1), 5), (('C', 2), 1), (('C', 3), 4), (('A', 4), 5),    
(('G', 5), 1), (('C', 6), 6), (('T', 7), 6), (('G', 0), 1), (('G', 1), 1),   
(('G', 2), 6), (('A', 5), 5), (('G', 3), 3), (('T', 5), 1), (('A', 1), 1), 
(('C', 7), 1), (('T', 0), 1), (('C', 4), 2), (('T', 6), 1)])

我想从上面的输出字典生成输出矩阵,怎么做呢?

提前多多感谢。

2 个答案:

答案 0 :(得分:0)

d = {}

count = ([(('A', 0), 5), (('T', 1), 5), (('C', 2), 1), (('C', 3), 4), (('A', 4), 5),(('G', 5), 1), (('C', 6), 6), (('T', 7), 6), (('G', 0), 1), (('G', 1), 1),   
(('G', 2), 6), (('A', 5), 5), (('G', 3), 3), (('T', 5), 1), (('A', 1), 1), 
(('C', 7), 1), (('T', 0), 1), (('C', 4), 2), (('T', 6), 1)])

for each in count:
    if each[0][0] in d:
        li = d[each[0][0]]
        spot = each[0][1]
        li[spot] = each[1]
        d[each[0][0]] = li
    else:       

        li=[0]*8
        spot = each[0][1]
        li[spot] = each[1]
        d[each[0][0]] = li

for each in sorted(d):
    print each," ",d[each]

sol=""
for each in range(8):
    sol+=max(d, key=lambda x:d[x][each])
print sol

我刚刚对整个字典进行了迭代,并根据你的问题创建了一个新的字典。

但你可以在修改字典计数时这样做。我假设列表的长度为8.如果它高于8.应该相应地修改上面的内容。

如果您能够直接编辑问题,那就太好了。

答案 1 :(得分:0)

以下是使用BioPythoncollections.Counter

的解决方案
from Bio import SeqIO
from collections import Counter

def main(fasta_file):
    """
    >>> print main(r'./data/CONS_sample.fa')
    ATGCAACT
    A: 5 1 0 0 5 5 0 0
    C: 0 0 1 4 2 0 6 1
    G: 1 1 6 3 0 1 0 0
    T: 1 5 0 0 0 1 1 6
    """
    with open(fasta_file) as fh:
        dna_strings = [str(fasta.seq) for fasta in SeqIO.parse(fh, 'fasta')]
        transposed = zip(*dna_strings)
        counters = [Counter(column) for column in transposed]

        # create consensus
        consensus = ''.join([counter.most_common(1)[0][0] for counter in counters])

        # create profile matrix
        matrix = ''
        for base in 'ACGT':
            matrix += '{}:'.format(base)
            for counter in counters:
                matrix += ' {}'.format(counter[base])
            matrix += '\n'
        matrix = matrix.rstrip()

        return '\n'.join([consensus, matrix])

if __name__ == '__main__':
    import doctest
    doctest.testmod()

    print main(r'./data/CONS.txt')