使用BCBio的GFF解析器进行不正确的解析

时间:2013-11-25 10:47:21

标签: python parsing biopython gbk

我正在尝试使用BCBio的GFF解析器,希望我可以将它用于我的工具。我从NCBI的RefSeq数据库中获取了一个测试.gbk文件,并用它来解析成.gff文件。

我使用的代码(来自http://biopython.org/wiki/GFF_Parsing):

#!/usr/bin/python
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
    in_file = "/var/www/localhost/NC_009925.gbk"
    out_file = "/var/www/localhost/output/your_file.gff"
    in_handle = open(in_file)
    out_handle = open(out_file, "w")

    GFF.write(SeqIO.parse(in_handle, "genbank"), out_handle)

    in_handle.close()
    out_handle.close()

convert_to_GFF3()

以下是结果的一部分:

##gff-version 3
##sequence-region NC_009925.1 1 6503724
NC_009925.1 annotation  remark  1   6503724 .   .   .   accessions=NC_009925;comment=PROVISIONAL REFSEQ: This record has not yet been subject to final%0ANCBI review. The reference sequence was derived from CP000828.%0ASource bacteria from Marine Biotechnology Institute Culture%0ACollection%2C Marine Biotechnology Institute%2C 3-75-1 Heita%2C Kamaishi%2C%0AIwate 026-0001%2C Japan.%0ACOMPLETENESS: full length.;data_file_division=CON;date=10-JUN-2013;gi=158333233;keywords=;organism=Acaryochloris marina MBIC11017;references=location: %5B0:6503724%5D%0Aauthors: Swingley%2CW.D.%2C Chen%2CM.%2C Cheung%2CP.C.%2C Conrad%2CA.L.%2C Dejesa%2CL.C.%2C Hao%2CJ.%2C Honchak%2CB.M.%2C Karbach%2CL.E.%2C Kurdoglu%2CA.%2C Lahiri%2CS.%2C Mastrian%2CS.D.%2C Miyashita%2CH.%2C Page%2CL.%2C Ramakrishna%2CP.%2C Satoh%2CS.%2C Sattley%2CW.M.%2C Shimada%2CY.%2C Taylor%2CH.L.%2C Tomo%2CT.%2C Tsuchiya%2CT.%2C Wang%2CZ.T.%2C Raymond%2CJ.%2C Mimuro%2CM.%2C Blankenship%2CR.E. and Touchman%2CJ.W.%0Atitle: Niche adaptation and genome expansion in the chlorophyll d-producing cyanobacterium Acaryochloris marina%0Ajournal: Proc. Natl. Acad. Sci. U.S.A. 105 %286%29%2C 2005-2010 %282008%29%0Amedline id: %0Apubmed id: 18252824%0Acomment:,location: %5B0:6503724%5D%0Aauthors: %0Aconsrtm: NCBI Genome Project%0Atitle: Direct Submission%0Ajournal: Submitted %2817-OCT-2007%29 National Center for Biotechnology Information%2C NIH%2C Bethesda%2C MD 20894%2C USA%0Amedline id: %0Apubmed id: %0Acomment:,location: %5B0:6503724%5D%0Aauthors: Touchman%2CJ.W.%0Atitle: Direct Submission%0Ajournal: Submitted %2827-AUG-2007%29 Pharmaceutical Genomics Division%2C Translational Genomics Research Institute%2C 13208 E Shea Blvd%2C Scottsdale%2C AZ 85004%2C USA%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;source=Acaryochloris marina MBIC11017;taxonomy=Bacteria,Cyanobacteria,Oscillatoriophycideae,Chroococcales,Acaryochloris
NC_009925.1    feature  source  1   6503724 .   +   .   db_xref=taxon:329726;mol_type=genomic DNA;note=type strain of Acaryochloris marina;organism=Acaryochloris marina MBIC11017;strain=MBIC11017
NC_009925.1    feature  gene    931 1581    .   -   .   db_xref=GeneID:5685235;locus_tag=AM1_0001;note=conserved hypothetical protein;pseudo=
NC_009925.1    feature  gene    1627    2319    .   -   .   db_xref=GeneID:5678840;locus_tag=AM1_0003

问题在于第三行和第四行:它从.gbk获取完整的头信息并将其作为一行放入,而它应该跳过它。最后两行是正确的(输出文件的其余部分也是如此)。我尝试过使用几种不同的.gbk文件,都会产生相同的结果。

作为参考,这是.gbk文件的开头:

LOCUS       NC_009925            6503724 bp    DNA     circular CON 10-JUN-2013
DEFINITION  Acaryochloris marina MBIC11017 chromosome, complete genome.
ACCESSION   NC_009925
VERSION     NC_009925.1  GI:158333233
DBLINK      Project: 58167
            BioProject: PRJNA58167
KEYWORDS    .
SOURCE      Acaryochloris marina MBIC11017
  ORGANISM  Acaryochloris marina MBIC11017
            Bacteria; Cyanobacteria; Oscillatoriophycideae; Chroococcales;
            Acaryochloris.
REFERENCE   1  (bases 1 to 6503724)
  AUTHORS   Swingley,W.D., Chen,M., Cheung,P.C., Conrad,A.L., Dejesa,L.C.,
            Hao,J., Honchak,B.M., Karbach,L.E., Kurdoglu,A., Lahiri,S.,
            Mastrian,S.D., Miyashita,H., Page,L., Ramakrishna,P., Satoh,S.,
            Sattley,W.M., Shimada,Y., Taylor,H.L., Tomo,T., Tsuchiya,T.,
            Wang,Z.T., Raymond,J., Mimuro,M., Blankenship,R.E. and
            Touchman,J.W.
  TITLE     Niche adaptation and genome expansion in the chlorophyll
            d-producing cyanobacterium Acaryochloris marina
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 105 (6), 2005-2010 (2008)
   PUBMED   18252824
REFERENCE   2  (bases 1 to 6503724)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (17-OCT-2007) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   3  (bases 1 to 6503724)
  AUTHORS   Touchman,J.W.
  TITLE     Direct Submission
  JOURNAL   Submitted (27-AUG-2007) Pharmaceutical Genomics Division,
            Translational Genomics Research Institute, 13208 E Shea Blvd,
            Scottsdale, AZ 85004, USA
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence was derived from CP000828.
            Source bacteria from Marine Biotechnology Institute Culture
            Collection, Marine Biotechnology Institute, 3-75-1 Heita, Kamaishi,
            Iwate 026-0001, Japan.
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..6503724
                     /organism="Acaryochloris marina MBIC11017"
                     /mol_type="genomic DNA"
                     /strain="MBIC11017"
                     /db_xref="taxon:329726"
                     /note="type strain of Acaryochloris marina"
     gene            complement(931..1581)
                     /locus_tag="AM1_0001"
                     /note="conserved hypothetical protein"
                     /pseudo
                     /db_xref="GeneID:5685235"
     gene            complement(1627..2319)
                     /locus_tag="AM1_0003"
                     /db_xref="GeneID:5678840"
     CDS             complement(1627..2319)
                     /locus_tag="AM1_0003"
                     /codon_start=1
                     /transl_table=11
                     /product="NUDIX hydrolase"
                         /protein_id="YP_001514406.1"
                     /db_xref="GI:158333234"
                     /db_xref="GeneID:5678840"
                     /translation="MPYTYDYPRPGLTVDCVVFGLDEQIDLKVLLIQRQIPPFQHQWA
                 LPGGFVQMDESLEDAARRELREETGVQGIFLEQLYTFGDLGRDPRDRIISVAYYALIN
                 LIEYPLQASTDAEDAAWYSIENLPSLAFDHAQILKQAIRRLQGKVRYEPIGFELLPQK
                 FTLTQIQQLYETVLGHPLDKRNFRKKLLKMDLLIPLDEQQTGVAHRAARLYQFDQSKY
                 ELLKQQGFNFEV"

有谁知道如何解决这个问题?

我使用以下行来过滤掉前两行错误的行:

if "\tannotation\t" in line or "feature\tsource" in line:

这似乎适用于几个测试.gbk的。但我仍然很好奇为什么它首先解析那些?

1 个答案:

答案 0 :(得分:1)

答案在您关联的Wiki页面中(http://biopython.org/wiki/GFF_Parsing#Writing_GFF3)。 “GFF3Writer采用SeqRecord对象的迭代器,并将每个SeqFeature写为GFF3行”。从SeqRecord文件解析的.gbk对象包含此注释,因此它由编写者编写。在实现(https://github.com/chapmanb/bcbb/blob/master/gff/BCBio/GFF/GFFOutput.py)中,您可以看到它的完成位置:

self._write_annotations(rec.annotations, rec.id, len(rec.seq), out_handle)

您还可以看到为什么source功能已通过。它只是其他功能(基因,CDS)而不是单独处理。

我不知道为什么没有选项或参数(至少我没找到)告诉作者跳过注释。在使用SeqRecords阅读SeqIO.parse()时,我不知道有任何跳过注释的参数。

要解决您的问题,我会分别访问已解析的SeqRecords,删除注释和源功能。这种方法的一个缺点是需要额外的内存(以及性能损失),因为我正在从初始生成器创建一个List。最后我只是将List解析为GFF。我不知道这种方法是否比你的方法好得多。

#!/usr/bin/env python
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
    in_file = "input.gbk"
    out_file = "output.gff"
    in_handle = open(in_file)
    out_handle = open(out_file, "w")

    records = []
    for record in SeqIO.parse(in_handle, "genbank"):
        # delete annotations
        record.annotations = {}
        # loop through features to find the source
        for i in range(0,len(record.features)):
            # if found, delete it and stop (only expect one source)
            if(record.features[i].type == "source"):
                record.features.pop(i)
                break
        records.append(record)

    GFF.write(records, out_handle)

    in_handle.close()
    out_handle.close()

convert_to_GFF3()