为了解决这个问题,我使用了BioPython库。不过我想学习编程,因此我不想使用BioPython库。
我有一个包含以下DNA序列的Fasta文件:
>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79
TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC
AGGACAGGCCGCTAAAGTGTGGTTTCGTGGTT
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80
CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG
GCCTGGTAACACGTGCCAGCGACAGCTGCTCGTA
>chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86
CACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAAC
CAAACCACTTTCACCGCCACACGACCCTTCAACTCCTACATACTTCCCCCA
TTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGA
>chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120
TGAACCTACGACTACACCGACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCA
TTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCG
然后我想创建一个字典,第一行以>开头作为字典的关键,序列作为值。
同时,对于4个序列中的每一个,我想知道如何获得每个DNA碱基的数量?
由于
答案 0 :(得分:1)
要在终端中运行此代码: python readfasta.py fastafile.fasta
import string, sys
##########I. To Load Fasta File##############
file = open(sys.argv[1])
rfile = file.readline()
seqs = {}
##########II. To Make fasta dictionary####
tnv = ""#temporal name value
while rfile != "":
if ">" in rfile:
tnv = string.strip(rfile)
seqs[tnv] = ""
else:
seqs[tnv] += string.strip(rfile)
rfile = file.readline()
##############III. To Make Counts########
count_what = ["A", "T", "C", "G", "ATG"]
for s in seqs:
name = s
seq = seqs[s]
print s # to print seq name if you have a multifasta file
for cw in count_what:
print cw, seq.count(cw)# to print counts by seq
答案 1 :(得分:0)
使用python不是一项艰巨的任务,但到目前为止你尝试了什么?
import pprint
with open('/path/to/subject.fasta') as f:
ret = {}
all_bases = ''
bases = ''
description_line = ''
for l in f:
l = l.strip()
if l.startswith('>'):
if bases:
ret[description_line] = bases
bases = ''
description_line = l
else:
bases += l
all_bases += l
if bases:
ret[description_line] = bases
pprint.pprint(ret)
你得到了:
{'>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79':
'TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGCAGGACAGGCCGCTAAAGTGTGGTTTCGTGGTT',
...}
计算所有基数:
from collections import Counter
print(Counter(all_bases))
的产率:
Counter({'C': 150, 'T': 114, 'A': 110, 'G': 91})