我已经问过有关此计划的问题here
我目前正在运行的代码是
import re
out = open("parse_go.txt", "w")
id_to_info = {} #declare dictionary
def parse_record(term):
go_id = re.findall(r"id:\s(.*?)\n", term, re.DOTALL)[0]
name = re.findall(r"name:\s(.*?)\n", term, re.DOTALL)[0]
namespace = re.findall(r"namespace:\s(.*?)\n", term, re.DOTALL)[0]
is_a = re.findall(r"is_a:\s(.*?)\n", term, re.DOTALL)
is_a = "\n\t".join(is_a)
info = namespace + "\n" + "\t" + name + "\n" + "\t" + is_a
id_to_info[go_id] = info
for go_id, info in id_to_info.items():
out.write(go_id + "\t" + info + "\n\n")
# for go_id in id_to_info:
# out.write(go_id + "\t" + info + "\n\n")
def split_record(record):
sp_file = open(record)
sp_records = sp_file.read()
sp_split_records = re.findall(r"(\[.*?)\n\n", sp_records, re.DOTALL)
for sp_record in sp_split_records:
parse_record(term=sp_record)
sp_file.close()
split_record(record="/scratch/go-basic.obo")
但是从输出文件的开头我可以看到,我得到了相同结果的多个打印输出。
GO:0000001 biological_process
mitochondrion inheritance
GO:0048308 ! organelle inheritance
GO:0000002 biological_process
mitochondrial genome maintenance
GO:0000001 biological_process
mitochondrion inheritance
GO:0048308 ! organelle inheritance
GO:0000002 biological_process
mitochondrial genome maintenance
GO:0000003 biological_process
reproduction
GO:0000001 biological_process
mitochondrion inheritance
GO:0048308 ! organelle inheritance
GO:0000005 molecular_function
ribosomal chaperone activity
GO:0000002 biological_process
mitochondrial genome maintenance
GO:0000003 biological_process
reproduction
GO:0000001 biological_process
mitochondrion inheritance
GO:0048308 ! organelle inheritance
输入文件的开头如下,但它是一个非常大的文件,需要很长时间才能运行
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]
is_a: GO:0007005 ! mitochondrion organization
[Term]
id: GO:0000003
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms." [GOC:go_curators, GOC:isa_complete, GOC:jl, ISBN:0198506732]
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
subset: gosubset_prok
synonym: "reproductive physiological process" EXACT []
xref: Wikipedia:Reproduction
is_a: GO:0008150 ! biological_process
[Term]
id: GO:0000005
name: ribosomal chaperone activity
namespace: molecular_function
def: "OBSOLETE. Assists in the correct assembly of ribosomes or ribosomal subunits in vivo, but is not a component of the assembled ribosome when performing its normal biological function." [GOC:jl, PMID:12150913]
comment: This term was made obsolete because it refers to a class of gene products and a biological process rather than a molecular function.
is_obsolete: true
consider: GO:0042254
consider: GO:0044183
consider: GO:0051082
我知道字典不会按数字顺序排列,但是我想知道是否会出现这样的多个打印件,或者是因为编码中出现了一些错误?