使用biopython从NCBI获取基因组

时间:2019-03-13 22:50:10

标签: python bioinformatics biopython genome

Python newby在这里。我想使用BioPython包Entrez和SeqIO下载基因组的基因组序列(NC_007779.1)。到目前为止,我有以下代码:

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "me@alsome.org"
handle = Entrez.efetch(db="nuccore", id="NC_007779.1", rettype="gb", retmode="text")
genome = SeqIO.read(handle, "genbank")
print(genome)

但是我什么也没得到。任何帮助将不胜感激。

提前谢谢!

1 个答案:

答案 0 :(得分:0)

我收到您的代码的答复:

def surf_entrez():
    from Bio import Entrez
    from Bio import SeqIO

    Entrez.email = "me@alsome.org"
    handle = Entrez.efetch(db="nuccore", id="NC_007779.1", rettype="gb", retmode="text")
    genome = SeqIO.read(handle, "genbank")
    print(genome)

surf_entrez()

# RESULT
#/sequence_version=1
#/organism=Escherichia coli str. K-12 substr. W3110
#/data_file_division=CON
#/structured_comment=OrderedDict([('Genome-Annotation-Data', OrderedDict([('Annotation #Provider', 'NCBI'), ('Annotation Date', '02/22/2017 01:34:58'), ('Annotation Pipeline', #'NCBI Prokaryotic Genome'), ('Annotation Method', 'Best-placed reference protein'), ('Annotation Software revision', '4.1'), ('Features Annotated', 'Gene; CDS; rRNA; tRNA; ncRNA;'), ('Genes (total)', '4,793'), ('CDS (total)', '4,671'), ('Genes (coding)', '4,471'), ('CDS (coding)', '4,471'), ('Genes (RNA)', '122'), ('rRNAs', '8, 7, 7 (5S, 16S, 23S)'), ('complete rRNAs', '8, 7, 7 (5S, 16S, 23S)'), ('tRNAs', '87'), ('ncRNAs', '13'), ('Pseudo Genes (total)', '200'), ('Pseudo Genes (ambiguous residues)', '0 of 200'), ('Pseudo Genes (frameshifted)', '99 of 200'), ('Pseudo Genes (incomplete)', '77 of 200'), #('Pseudo Genes (internal stop)', '66 of 200'), ('Pseudo Genes (multiple problems)', '38 #of 200'), ('CRISPR Arrays', '2')]))])
#/date=22-FEB-2017
#/topology=circular
#/taxonomy=['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacterales', 
#'Enterobacteriaceae', 'Escherichia']
#/keywords=['RefSeq']
#/contig=join(AP009048.1:1..4646332)
#/accessions=['NC_007779', 'NZ_AB001340', 'NZ_D10483', 'NZ_D26562', 'NZ_D83536', 
#'NZ_D90699-D90711', 'NZ_D90713-D90754', 'NZ_D90756-D90878', 'NZ_D90880-D90897']
#UnknownSeq(4646332, alphabet=IUPACAmbiguousDNA(), character='N')

看起来正确吗?

您也可以使用SeqIO.parse

handle = Entrez.efetch(db="nuccore", id="U49845", rettype="gb", retmode="text")
genome = SeqIO.parse(handle, "genbank")
for record in genome:
    print(record.id, len(record))
    print(record)

它看起来类似于GenBank file format