Question

我正在尝试编写一个Python脚本，它将一种特殊类型的文件作为输入该文件包含有关多个基因的信息，有关一个基因的信息是通过多行写的，其中每个基因的行数不同。一个例子是：

 gene            join(373616..374161,1..174)
                 /locus_tag="AM1_A0001"
                 /db_xref="GeneID:5685236"
 CDS             join(373616..374161,1..174)
                 /locus_tag="AM1_A0001"
                 /codon_start=1
                 /transl_table=11
                 /product="glutathione S-transferase, putative"
                 /protein_id="YP_001520660.1"
                 /db_xref="GI:158339653"
                 /db_xref="GeneID:5685236"
                 /translation="MKIVSFKICPFVQRVTALLEAKGIDYDIEYIDLSHKPQWFLDLS
                 PNAQVPILITDDDDVLFESDAIVEFLDEVVGTPLSSDNAVKKAQDRAWSYLATKHYLV
                 QCSAQRSPDAKTLEERSKKLSKAFGKIKVQLGESRYINGDDLSMVDIAWLPLLHRAAI
                 IEQYSGYDFLEEFPKVKQWQQHLLSTGIAEKSVPEDFEERFTAFYLAESTCLGQLAKS
                 KNGEACCGTAECTVDDLGCCA"
 gene            241..381
                 /locus_tag="AM1_A0002"
                 /db_xref="GeneID:5685411"
 CDS             241..381
                 /locus_tag="AM1_A0002"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="YP_001520661.1"
                 /db_xref="GI:158339654"
                 /db_xref="GeneID:5685411"
                 /translation="MLINPEDKQVEIYRPGQDVELLQSPSTISGADVLPEFSLNLEWI
                 WR"
 gene            388..525
                 /locus_tag="AM1_A0003"
                 /db_xref="GeneID:5685412"
 CDS             388..525
                 /locus_tag="AM1_A0003"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="YP_001520662.1"
                 /db_xref="GI:158339655"
                 /db_xref="GeneID:5685412"
                 /translation="MKEAGFSENSRSREGQPKLAKDAAIAKPYLVAMTAELQIMATET
                 L"

我现在想要的是创建一个词典列表，其中每个词典都包含有关一个基因的信息，如下所示：

gene_1 = {"locus": /locus_tag, "product": /product, ...}
gene_2 = {"locus": /locus_tag, "product": /product, ...}

我完全不知道当一个基因/词典完成并且下一个应该开始时，我怎么能让Python知道有人可以帮帮我吗？有没有办法做到这一点？

澄清：我知道如何提取我想要的信息，将其保存在变量中并将其输入字典中。我只是不知道如何告诉Python为每个基因创建一个字典。

Answer 1

我为这个纯Python做了一个也许不太好但功能很强的解析器，也许它至少可以用作一个基本的想法：

import re
import pprint
printer = pprint.PrettyPrinter(indent=4)

with open("entities.txt", "r") as file_obj:
    entities = list()

    for line in file_obj.readlines():
        line = line.replace('\n', '')

        if re.match(r'\s*(gene|CDS)\s+[\w(\.,)]+', line):
            parts = line.split()
            entity = {parts[0]: parts[1]}
            entities.append(entity)
        else:
            try:
                (attr_name,) = re.findall(r'/\w+=', line)
                attr_name = attr_name.strip('/=')
            except ValueError:
                addition = line.strip()
                entity[last_key] = ''.join([entity[last_key], addition])
            else:
                try:
                    (attr_value,) = re.findall(r'="\w+$', line)
                    last_key = attr_name
                except ValueError:
                    try:
                        (attr_value,) = re.findall(r'="[\w\s\.:,-]+"', line)
                    except ValueError:
                        (attr_value,) = re.findall(r'=\d+$', line)

                    attr_value = attr_value.strip('"=')

                if attr_name in entity:
                    entity[attr_name] = [entity[attr_name], attr_value]
                else:
                    entity[attr_name] = attr_value

printer.pprint(entities)

Answer 2

如果有人对初学者的解决方案感兴趣，我会在收到的评论的帮助下找到，这里是：

import sys, re

annot = file("example.embl", "r")
embl = ""
annotation = []

for line in annot:
    embl += line

embl_list = embl.split("FT   gen")

for item in embl_list:
    if "e            " in item:
        split_item = item.split("\n")
        for l in split_item:
            if "e            " in l:
                if not "complement" in l:
                    coordinates = l[13:len(l)]
                    C = coordinates.split("..")
                    genestart = C[0]
                    geneend = C[1]
                    strand = "+"
                if "complement" in l:
                    coordinates = l[24:len(l)-1]
                    C = coordinates.split("..")
                    genestart = C[0]
                    geneend = C[1]
                    strand = "-"

            if "/locus_tag" in l:
                L = l.split('"')
                locus = L[1]

            if "/product" in l:
                P = l.split('"')
                product = P[1]

        annotation.append({
            "locus": locus,
            "genestart": genestart,
            "geneend": geneend,
            "product": product,
        })
    else:
        print "Finished!"

从Python中的一个文件到多个词典

2 个答案: