基于bp坐标提取fasta序列的一部分

时间:2016-08-31 12:02:26

标签: python bioinformatics biopython fasta

我有一个巨大的fasta文件,但我需要只提取它的一部分,如果我知道我的序列的开始和结束碱基对坐标。此外,它应该是fasta格式,每行长度为60 bp。这是我的尝试,如果看起来不错,请告诉我,欢迎任何改进建议。

from Bio import SeqIO

inFile = open('full_chr.fa','r')
fw=open("part.fa",'w')
line_width = 60
for record in SeqIO.parse(inFile,'fasta'):
    fw.write(">" + record.id + "\n")
    fww = (str(record.seq[600130000:602000000]) + '\n')
    for i in xrange(0,len(fww),line_width):
        fw.write(str(fww[i:i+line_width]) + '\n')
fw.close()

1 个答案:

答案 0 :(得分:3)

It's as easy as:

from Bio import SeqIO


record = SeqIO.read("Chromosome.fas", "fasta")

with open("output.fas", "w") as out:
    SeqIO.write(record[100:500], out, "fasta")

The SeqIO.write already uses a 60 character length wrapping. If you want to manipulate the line wrap use the FastaWriter object. This is an example for 80 bp lines:

from Bio import SeqIO
from Bio.SeqIO.FastaIO import FastaWriter


record = SeqIO.read("Chromosome.fas", "fasta")

with open("output.fas", "w") as out:
    writer = FastaWriter(out, wrap=80)
    writer.write_header()
    writer.write_record(record[100:500])
    writer.write_footer()