从Python

时间:2017-12-24 13:58:08

标签: python dictionary fasta

为了解决这个问题,我使用了BioPython库。不过我想学习编程,因此我不想使用BioPython库。

我有一个包含以下DNA序列的Fasta文件:

>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79
TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC
AGGACAGGCCGCTAAAGTGTGGTTTCGTGGTT
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80
CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG
GCCTGGTAACACGTGCCAGCGACAGCTGCTCGTA
>chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86
CACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAAC
CAAACCACTTTCACCGCCACACGACCCTTCAACTCCTACATACTTCCCCCA
TTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGA
>chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120
TGAACCTACGACTACACCGACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCA
TTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCG

然后我想创建一个字典,第一行以>开头作为字典的关键,序列作为值。

同时,对于4个序列中的每一个,我想知道如何获得每个DNA碱基的数量?

由于

2 个答案:

答案 0 :(得分:1)

要在终端中运行此代码: python readfasta.py fastafile.fasta

import string, sys
##########I. To Load Fasta File##############
file = open(sys.argv[1]) 
rfile = file.readline()
seqs = {} 
##########II. To Make fasta dictionary####
tnv = ""#temporal name value
while rfile != "":
    if ">" in rfile:
        tnv = string.strip(rfile)
        seqs[tnv] = ""
    else:
        seqs[tnv] += string.strip(rfile)    
    rfile = file.readline()
##############III. To Make Counts########
count_what = ["A", "T", "C", "G", "ATG"]
for s in seqs:
    name = s
    seq = seqs[s]
    print s # to print seq name if you have a multifasta file
    for cw in count_what:
        print cw, seq.count(cw)# to print counts by seq

答案 1 :(得分:0)

使用python不是一项艰巨的任务,但到目前为止你尝试了什么?

import pprint

with open('/path/to/subject.fasta') as f:
    ret = {}

    all_bases = ''
    bases = ''
    description_line = ''
    for l in f:
        l = l.strip()
        if l.startswith('>'):
            if bases:
                ret[description_line] = bases
                bases = ''
            description_line = l
        else:
            bases += l
            all_bases += l
    if bases:
        ret[description_line] = bases

pprint.pprint(ret)
你得到了:

{'>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79':
 'TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGCAGGACAGGCCGCTAAAGTGTGGTTTCGTGGTT',
 ...}

计算所有基数:

from collections import Counter
print(Counter(all_bases))

的产率:

Counter({'C': 150, 'T': 114, 'A': 110, 'G': 91})