在Python中以正确的顺序从文件解析到字典

时间:2014-08-12 18:22:49

标签: parsing python-2.7 dictionary biopython

我编写了一些代码来解析EMBL文件并将文件的特定区域转储到字典中。

字典的键与我想要捕获的特定区域的标签相关,每个键的值都是区域本身。

然后我创建了另一个函数来将字典的内容写入文本文件。

但是,我发现文本文件包含的信息顺序与原始EMBL文件中的顺序不同。

我无法弄清楚它为什么会这样做 - 是因为词典是无序的?有什么方法吗?

from Bio import SeqIO

s6633 = SeqIO.read("6633_seq.embl", "embl")

def make_dict_realgenes(x):
    dict = {}
    for i in range(len(x.features)):
        if x.features[i].type == 'CDS':
            if 'hypothetical' not in x.features[i].qualifiers['product'][0]:
                try:
                    if x.features[i].location.strand == -1:
                        x1 = x.features[i].location.end
                        y1 = x1 + 30
                        dict[str(x.features[i].qualifiers['product'][0])] =\
                             str(x[x1:y1].seq.reverse_complement())
                    else:
                        x2 = x.features[i].location.start
                        y2 = x2 - 30
                        dict[x.features[i].qualifiers['product'][0]] =\
                             str(x[y2:x2].seq)
                except KeyError:
                    if x.features[i].location.strand == -1:
                        x1 = x.features[i].location.end
                        y1 = x1 + 30
                        dict[str(x.features[i].qualifiers['translation'][0])] =\
                             str(x[x1:y1].seq.reverse_complement())
                    else:
                        x2 = x.features[i].location.start
                        y2 = x2 - 30
                        dict[x.features[i].qualifiers['translation'][0]] =\
                             str(x[y2:x2].seq)
    return dict

def rbs_file(dict):
    list = []
    c = 0
    for k, v in dict.iteritems():
        list.append(">" + k + " " + str(c) + "\n" + v + "\n")
        c = c + 1

    f = open("out.txt", "w")
    a = 0
    for i in list:
        f.write(i)
        a = a + 1

    f.close()

2 个答案:

答案 0 :(得分:2)

要保留字典中的顺序,请使用OrderedDict中的collections。尝试将代码顶部更改为:

from collections import OrderedDict
from Bio import SeqIO

s6633 = SeqIO.read("6633_seq.embl", "embl")

def make_dict_realgenes(x):
    dict = OrderedDict()   
...

另外,我建议不要覆盖内置的“dict”字样。如果你可以轻松地重命名它。

答案 1 :(得分:0)

我稍微重构了你的代码,我建议在解析文件时编写输出,而不是在OrderedDicts中转发。

from Bio import SeqIO


output = open("out.txt", "w")

for seq in SeqIO.parse("CP001187.embl", "embl"):
    for feature in seq.features:
        if feature.type == "CDS":
            qualifier = (feature.qualifiers.get("product") or
                         feature.qualifiers.get("translation"))[0]
            if "hypothetical" not in qualifier:
                if feature.location.strand == -1: 
                    x1 = feature.location.end
                    x2 = x1 + 30
                    sequence = seq[x1:x2].seq.reverse_complement()
                else:
                    x1 = feature.location.start
                    x2 = x1 - 30
                    sequence = seq[x2:x1].seq

                output.write(">" + qualifier + "\n")
                output.write(str(sequence) + "\n")

                # You can always insert here to the OrderedDict anyway, e.g.
                # d[qualifier] = str(sequence)

output.close()

在python中,很少有for i in range(len(anything))是可行的方法。


还有一种更简洁的方法可以使用Biopython输出序列。使用列表来附加Seqs,而不是dict或OrderedDict:

from Bio.SeqRecord import SeqRecord

my_seqs = []

# Each time you generate a sequence, instead of writing to a file
# or inserting in dict, do this:
my_seqs.append(SeqRecord(sequence, id=qualifier, description=""))

 # Now you have the my_seqs, they can be writen in a single line:
SeqIO.write(my_seqs, "output.fas", "fasta")