我编写了一些代码来解析EMBL文件并将文件的特定区域转储到字典中。
字典的键与我想要捕获的特定区域的标签相关,每个键的值都是区域本身。
然后我创建了另一个函数来将字典的内容写入文本文件。
但是,我发现文本文件包含的信息顺序与原始EMBL文件中的顺序不同。
我无法弄清楚它为什么会这样做 - 是因为词典是无序的?有什么方法吗?
from Bio import SeqIO
s6633 = SeqIO.read("6633_seq.embl", "embl")
def make_dict_realgenes(x):
dict = {}
for i in range(len(x.features)):
if x.features[i].type == 'CDS':
if 'hypothetical' not in x.features[i].qualifiers['product'][0]:
try:
if x.features[i].location.strand == -1:
x1 = x.features[i].location.end
y1 = x1 + 30
dict[str(x.features[i].qualifiers['product'][0])] =\
str(x[x1:y1].seq.reverse_complement())
else:
x2 = x.features[i].location.start
y2 = x2 - 30
dict[x.features[i].qualifiers['product'][0]] =\
str(x[y2:x2].seq)
except KeyError:
if x.features[i].location.strand == -1:
x1 = x.features[i].location.end
y1 = x1 + 30
dict[str(x.features[i].qualifiers['translation'][0])] =\
str(x[x1:y1].seq.reverse_complement())
else:
x2 = x.features[i].location.start
y2 = x2 - 30
dict[x.features[i].qualifiers['translation'][0]] =\
str(x[y2:x2].seq)
return dict
def rbs_file(dict):
list = []
c = 0
for k, v in dict.iteritems():
list.append(">" + k + " " + str(c) + "\n" + v + "\n")
c = c + 1
f = open("out.txt", "w")
a = 0
for i in list:
f.write(i)
a = a + 1
f.close()
答案 0 :(得分:2)
要保留字典中的顺序,请使用OrderedDict
中的collections
。尝试将代码顶部更改为:
from collections import OrderedDict
from Bio import SeqIO
s6633 = SeqIO.read("6633_seq.embl", "embl")
def make_dict_realgenes(x):
dict = OrderedDict()
...
另外,我建议不要覆盖内置的“dict”字样。如果你可以轻松地重命名它。
答案 1 :(得分:0)
我稍微重构了你的代码,我建议在解析文件时编写输出,而不是在OrderedDicts中转发。
from Bio import SeqIO
output = open("out.txt", "w")
for seq in SeqIO.parse("CP001187.embl", "embl"):
for feature in seq.features:
if feature.type == "CDS":
qualifier = (feature.qualifiers.get("product") or
feature.qualifiers.get("translation"))[0]
if "hypothetical" not in qualifier:
if feature.location.strand == -1:
x1 = feature.location.end
x2 = x1 + 30
sequence = seq[x1:x2].seq.reverse_complement()
else:
x1 = feature.location.start
x2 = x1 - 30
sequence = seq[x2:x1].seq
output.write(">" + qualifier + "\n")
output.write(str(sequence) + "\n")
# You can always insert here to the OrderedDict anyway, e.g.
# d[qualifier] = str(sequence)
output.close()
在python中,很少有for i in range(len(anything))
是可行的方法。
还有一种更简洁的方法可以使用Biopython输出序列。使用列表来附加Seqs,而不是dict或OrderedDict:
from Bio.SeqRecord import SeqRecord
my_seqs = []
# Each time you generate a sequence, instead of writing to a file
# or inserting in dict, do this:
my_seqs.append(SeqRecord(sequence, id=qualifier, description=""))
# Now you have the my_seqs, they can be writen in a single line:
SeqIO.write(my_seqs, "output.fas", "fasta")