我正在研究Rosalind问题,特别是题为“共识与概况”的问题
数据输入如下:
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
以上是七个带有ID或标题的DNA序列,输出应该是这样的:
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
现在,到目前为止,这是我的代码,我想生成上面的矩阵,按列计算所有A的C,G和Ts:
import sys
import Bio.SeqIO
count = {}
count=OrderedDict()
list_seq = []
for seq in Bio.SeqIO.parse(sys.stdin, 'fasta'):
sequn = str(seq.seq)
print "sequn",sequn
for i,nuc in enumerate(sequn):
print "nuc", nuc
key = (nuc,i)
try:
count[key] = count[key]+1
except KeyError:
count[key] = 1
字典计数的输出如下所示:
([(('A', 0), 5), (('T', 1), 5), (('C', 2), 1), (('C', 3), 4), (('A', 4), 5),
(('G', 5), 1), (('C', 6), 6), (('T', 7), 6), (('G', 0), 1), (('G', 1), 1),
(('G', 2), 6), (('A', 5), 5), (('G', 3), 3), (('T', 5), 1), (('A', 1), 1),
(('C', 7), 1), (('T', 0), 1), (('C', 4), 2), (('T', 6), 1)])
我想从上面的输出字典生成输出矩阵,怎么做呢?
提前多多感谢。
答案 0 :(得分:0)
d = {}
count = ([(('A', 0), 5), (('T', 1), 5), (('C', 2), 1), (('C', 3), 4), (('A', 4), 5),(('G', 5), 1), (('C', 6), 6), (('T', 7), 6), (('G', 0), 1), (('G', 1), 1),
(('G', 2), 6), (('A', 5), 5), (('G', 3), 3), (('T', 5), 1), (('A', 1), 1),
(('C', 7), 1), (('T', 0), 1), (('C', 4), 2), (('T', 6), 1)])
for each in count:
if each[0][0] in d:
li = d[each[0][0]]
spot = each[0][1]
li[spot] = each[1]
d[each[0][0]] = li
else:
li=[0]*8
spot = each[0][1]
li[spot] = each[1]
d[each[0][0]] = li
for each in sorted(d):
print each," ",d[each]
sol=""
for each in range(8):
sol+=max(d, key=lambda x:d[x][each])
print sol
我刚刚对整个字典进行了迭代,并根据你的问题创建了一个新的字典。
但你可以在修改字典计数时这样做。我假设列表的长度为8.如果它高于8.应该相应地修改上面的内容。
如果您能够直接编辑问题,那就太好了。
答案 1 :(得分:0)
以下是使用BioPython
和collections.Counter
from Bio import SeqIO
from collections import Counter
def main(fasta_file):
"""
>>> print main(r'./data/CONS_sample.fa')
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
"""
with open(fasta_file) as fh:
dna_strings = [str(fasta.seq) for fasta in SeqIO.parse(fh, 'fasta')]
transposed = zip(*dna_strings)
counters = [Counter(column) for column in transposed]
# create consensus
consensus = ''.join([counter.most_common(1)[0][0] for counter in counters])
# create profile matrix
matrix = ''
for base in 'ACGT':
matrix += '{}:'.format(base)
for counter in counters:
matrix += ' {}'.format(counter[base])
matrix += '\n'
matrix = matrix.rstrip()
return '\n'.join([consensus, matrix])
if __name__ == '__main__':
import doctest
doctest.testmod()
print main(r'./data/CONS.txt')